OpenStreetMap 로고 OpenStreetMap

Check if POI website is active

Cascafico님이 Italian (Italiano)로 2023년 6월 23일에 게시함.

Websites URLs in many cases are not stone steady, hence monitoring their status can be wortly. Here is how to do it in a semi-automatic flavour:
* download POIs,
* ask their URL a reply
* store unresponsive websites OSM objects

Let’s start gathering a list of shops with website tag. This Overpass example query yelds a CSV with essential data separated by comma. You can see the result in Overpass data window.

To automate process (bash), we need the Overpass query string and provide it as an argument of wget command. In “Export”, simply copy the link you find in “raw data directly from Overpass API”, then (remembering to enclose link in double quotes)


$ wget -O mylist.csv "http://overpass-api.de/api/interpreter?data=%5Bout%3Acsvblablablabla"

at this point mylist.csv contains something like:


@id,@type,name,website
194581793,node,Sirene Blu,http://www.sireneblu.it/   
228109189,node,Ecoscaligera,http://www.ecoscaligera.com/   
[ETC, ETC]   

Now we need to scan each line of mylist.csv and wait for an http reply (ie: 200 OK, 300 moved, etc). It’s done running the following code:


#! /bin/bash
while IFS="," read -r OSMid OSMtype OSMname url
do
  REPLY=`curl --silent --head $url | awk '/^HTTP/{print $url}'`
  echo "https://www.openstreetmap.org/"$OSMtype/$OSMid","$REPLY
done < <(tail -n +2 mylist.csv)

Let’s call the above script replies.sh. The output could be something like:


$ ./replies.sh 
https://www.openstreetmap.org/node/287058106,HTTP/1.1 301 Moved Permanently
https://www.openstreetmap.org/node/424738144,HTTP/1.1 301 Moved Permanently
https://www.openstreetmap.org/node/534834927,HTTP/2 301 
https://www.openstreetmap.org/node/766863973,HTTP/1.1 200 OK
[ETC, ETC]

Redirecting to a file, such output can be easily filtered with grep in order to obtain a list of OSM objects whole website tag needs to be updated (to null):


$ ./replies.sh | grep  " 403 " > shops-to-update

tags: linux, bash, URL

이메일 아이콘 Bluesky 아이콘 Facebook 아이콘 LinkedIn 아이콘 마스토돈 아이콘 텔레그램 아이콘 X 아이콘

토론

2023년 6월 24일 17:22bryceco님의 의견

So nice to see the fluent use of bash and the unix toolchain for this. It feels like it’s a dying art.

There is a QA tool (I unfortunately don’t recall which one) that does something similar, but also ensures that the POI name appears somewhere on the first page of the site. With a slightly longer awk script you could do the same.

2023년 6월 26일 09:57Marcos Dione님의 의견

I don’t understand that last line:

$ ./replies.sh | grep " 403 " > shops-to-update

That would mark sites with (auth) errors as up-to-date?

The results can also be misleading. Old sites could have been bought by a DNS provider, which they usually have a selling page for which you get a 200 or one of the 300s. The tool that @bryceco mentions seems more accurate. It can also be done with bash, and maybe elinks for rendering the page into a text only file.

2023년 6월 27일 21:01Cascafico님의 의견

I wrote this script because someone on it OSM TG channel told that a certain QA tool was dismissed. But no one remembered its name…

About the 403, my mistake… I should keep 500-503. Besides someone just suggested me to add a Mozilla family useragent (-A in curl command) to avoid false positives.

2023년 6월 29일 10:00Strubbl님의 의견

I’d written a SW for that purpose, where you can give a relation ID and check all websites within that border. E.g. the analysis for Munich looks like this: https://osm.strubbl.de/olv/table-3600062428.html

Source is available here: https://codeberg.org/strubbl/osm-link-validator

2023년 7월 9일 16:31Minh Nguyen님의 의견

Crossposted to this forum thread about solutions for avoiding link rot.

댓글을 남기려면 로그인하세요