Commit 3f48391d authored by Tobias Steiner's avatar Tobias Steiner
Browse files

Update readme

parent 5e29e643
Pipeline #2308 passed with stage
in 2 minutes and 27 seconds
# Api
The API provides a gateway to the streaming plattform. With a http interface users can get geoconcordances from different processor apps.
The API provides a gateway to the streaming platform. With a http interface users can get geoconcordances from different streaming apps.
## How it works
Unlike a classic api, this api never communicates with the different component itself. So this code will never send a query to a DB or another store. It forward the http request to a kafka topic (*-request) and wait in another topic (*-resposne) for possible responses. On the topic we will have different resolver app to fullfill the request. If the resolver not answer in a given timeframe we will response with a 404.
Unlike a classic api, this api never communicates with the different component itself. So this code will never send a query to a DB or another store. It forward the http request to a kafka topic (*-request) and wait in another topic (*-resposne) for possible responses. On the topic we will have different resolver app to fulfill the request. If the resolver not answer in a given timeframe we will response with a 404.
## Too complicate
This is a complicate way to query a DB or elastic search index, but it gives use flexibility and loose coupling. If we want to add a new algorithm we can do this without changing the api code. We can add a new subsystem without changing anything. It also scales more independently. If one subsystem is too slow we can just scale this subsystem. The future requirement will show us if the complexity is worse the effort.
......@@ -10,71 +10,71 @@ This is a complicate way to query a DB or elastic search index, but it gives use
## Idempotent
This implementation conflicts with the http idempotent specification. So there is no grantee to get the same answer for the same request. Under several circumstances we will get different answers:
* If a processor app is crashed the answer will be different.
* If a processor app is crashed the answer will be different - mostly 404.
* If there is high load on the server the answer may be different.
* If caching of queries are successful the answer may e different
* If caching of queries is successful the answer may be different
In general the api tries to deliver at least one valid answer.
# todo: rewrite config from here
## Endpoints
Until now we offer three endpoints. More will follow
### sameas
We offer a general endpoint `sameas` to get aggregated results over a bunch of processors. This endpoints reduces requests and helps to get the compacted result.
#### Config processors
You can configure the different processor to get differentiated results. Under the hood those arguments are forwarded to the processor app and those will use it to calculate the result.
### /v1/sameas/:node
To get manually or semi- manually linked entities out of the histHub-geolinker you can us the sameas endpoint. This endpoint offers you high quality links mostly done by humans and stored in the geolinker. This dataset includes links from our partners and links that we generated to some other servies like Geonames and Wikidata.
In the underlying datastore (neo4j) each link is represented as an arrow between two nodes. E.g. URI1 (Project A) <-> URI2 (Project B). The API allows to fetch just those first level links. So you will get the links from and to your URI. You can also query the second and the third level of the network tpo get more links.
```bash
# this will give us a list of same links for the resource https://www.wikidata.org/wiki/Q1976179
curl -X POST http://localhost:3000/same_as/https://www.wikidata.org/wiki/Q1976179
# this will give us a list of concordances for the resource https://dodis.ch/G8
# we will get the first level of the network
curl -X GET https://api.geolinker.histhub.ch/v1/sameas/https://dodis.ch/G8
```
#### Config processors
You can configure the different processor and so get different aggregated results. Under the hood those arguments are forwarded to the processor app and those will use it. Another possibility is to use the specific endpoint for the processor app. For the configuration see the individual endpoints
### same_as/match
This endpoints preform a search in the elastic search index and compares an indexed document with url XYZ with all the other documents. This helps to find similar places on the fly and is a part of a matching-algorithm
Over the parameter `depth` you can define how many layers of hops you like to travers to aggregate your result. In the example below you will fetch all the nodes directly connected to the queried node (5 Nodes). And you will also get all the node connected with the result od the first query (~20 Nodes)
```bash
# this will give us a list of matched links for the resource https://www.wikidata.org/wiki/Q1976179
curl -X POST http://localhost:3000/same_as/match/https://www.wikidata.org/wiki/Q1976179 -d
{
distance: '15km',
fuzzyness: '~2',
}
# this will give us a list of concordances for the resource https://dodis.ch/G8
# we will get the first nad the second level of the network
curl -X GET https://api.geolinker.histhub.ch/v1/sameas/https://dodis.ch/G8?depth=2
```
### same_as/concordance
This endpoints preform a search in the neo4j index and returns all the document with a connected graph to the node with url XYZ. We save preprocessed connections, manual generated connections and connections from the providers in this index. You can traverse the graph in different depths.
The idea behind this network approach is simple. Research connect their entities with same entities in the network. The statement about the connection between two resources depends on the research question. In one context a statement is true, in another the same statement is 100% wrong. In a database about peace treaty the castle of Versailles is the same as city of Versailles. They connect the treaty with the geographical point of the city of versailles. For a research project about castles this statement is wrong. The city and the castle are not of the same type.
The geolinker ignores those conflicting satementes. It gives you the possiblity to link your resource with every other resource in the network. If it make sens for your reserach thats fine. So you can query all your own links with `depth=1`. Normaly you know the projects you link well and trust them. So you can get also all their links with `depth=1`. As further you traverse in the network the more unrelated links you will get, the more fuzzy is the result
### v1/similarto/node
Most oten you wanna query the sameas endpoint. If you cant find the resource in the sameas endpoint you can try the similarto endpoint. While the sameas-API returns stable connections that are manual or semi-automatically generated, the similarto service returns resources that are connected automatically based on various criteria of similarity. Depending on the configuration of your request you will get different results. If you query the API you can get resources with a similar name in the specified area around the queried node.
```bash
# this will give us a list of concordances for the resource https://www.wikidata.org/wiki/Q1976179
curl -X POST http://localhost:3000/same_as/concordance/https://www.wikidata.org/wiki/Q1976179 -d
{
depth: 1,
trust: ["dodis", "wikidata"]
}
# this will give us a list of automatically matched links for the resource https://dodis.ch/G300
curl -X GET https://api.geolinker.histhub.ch/v1/similarto/https://dodis.ch/G300
```
### same_as/metadata
This endpoints fetches metadata from one of the providers. It extracts Open Graph metadata and returns those data in the result. The performance of this processor is quite slow. It first waits for kafka to return a list of possible links and then can preform the request to extract the metadata.
```bash
# this will give us a list of attributes from one of the resource linked with https://www.wikidata.org/wiki/Q1976179
curl -X POST http://localhost:3000/same_as/match/https://www.wikidata.org/wiki/Q1976179 -d
{
hostnames: ["www.dodis.ch", "www.hls-dhs-dss.ch"]
}
You can specify the criteria for similarity over two parameter's. With `distance` you can define a maximal distance (in meter) between the queried node and a possible matched one. Normally the closer you find a similar resources to yours the higher is the chance of similarity.
Often names are written a bit different in two projects. With the parameter `fuzziness` you can define a [levenshtein distance](https://de.wikipedia.org/wiki/Levenshtein-Distanz) applied to the name of the resource. F.e. can you find `Hord` and `Nord` with a levenshtein distance of one.
```bash
# this will give us a list of automatically matched links for the resource https://dodis.ch/G300 with a maximal distance of 22km and a fuzziness of 2 to the queried node
https://api.geolinker.histhub.ch/v1/similarto/https://dodis.ch/G300?distance=20000&fuzziness=2
```
## Docker
To build the image use following command. The image will fetch data from a wikidata topic and streams the result back into kafka. The container based on linux alpine.
To build the image use following command. The image will provide an api interface for kafka topics.
```bash
docker build -t source.dodis.ch:4577/histhub/api .
# Upload to the registry
docker push source.dodis.ch:4577/histhub/api
```
## CD
## CI/CD
We have a build pipline in gitlab. So manually building of the image is not longer necessary.
## Deploy to k8
We execute a job on k8 to stream the dump into kafka
In the deployment repository you can find the configutation to start a k8 pod
```bash
kubectl create -f api-deployment.yaml
```
\ No newline at end of file
```
## Troubleshooting
* Please [urlencode](https://de.wikipedia.org/wiki/URL-Encoding) the url you query for. Otherwise you may get a 400 error
* If you recive 404 all the time one of the worker may crashed. Please contact us and we restart it. We currently working on the stability of the cluster
# Future
* We will provide a metadata endppoint to fetch strutured data about the resource
* We will create an endpoint where you can regoncile data
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment