OpenStreetMap logo OpenStreetMap

Wikipedia extract

Posted by h4ck3rm1k3 on 23 July 2009 in English.

So,
I found out yesterday that I dont need to parse the whole wikipedia to get the points.
http://www.webkuehn.de/hobbys/wikipedia/geokoordinaten/Wikipedia_en_2008-03-12.zip

So, now I have split that file by the area of interest.
First I parsed out my kosovo boundries file and extracte d the min/max of lat and lon.

http://bazaar.launchpad.net/%7Ejamesmikedupont/%2Bjunk/openstreetmap-wikipedia/revision/2/extract.pl#extract.pl

using the openstreetmapkosova/kosovaadmin.osm from my osm branch, also in lp.

This produced :
LAT Avg 42.3764065194805
cnt 539
Min 41.8534278
Max 43.2723636
size 1.4189358

LON Avg 20.9177833755102
cnt 539
Min 20.0722732
Max 21.8005791
size 1.7283059

I modified those coords, and used my stripkml to extract all the points in that box :
http://bazaar.launchpad.net/%7Ejamesmikedupont/%2Bjunk/openstreetmap-wikipedia/revision/2/stripkml.pl

Then I used kml2osm from here
http://osmlib.rubyforge.org/ http://rubyforge.org/projects/osmlib/

The modified version is for the wikipedia points only.
You need to run it with the exisiting osmlib, i run it in the examples dir.
http://bazaar.launchpad.net/%7Ejamesmikedupont/%2Bjunk/openstreetmap-wikipedia/revision/4/kml2osm

Here is the result:
http://bazaar.launchpad.net/%7Ejamesmikedupont/%2Bjunk/openstreetmap-wikipedia/revision/4/KosovoWP.osm

Here is my changeset, uploaded with josm.
osm.org/browse/changeset/1911214

Now I would like to have a way to process the wikipedia
data in chunks like this. There should be a way to extract just parts of the zipfile or bz2 file.

Thanks,
mike

Email icon Bluesky Icon Facebook Icon LinkedIn Icon Mastodon Icon Telegram Icon X Icon

Discussion

Comment from lyx on 23 July 2009 at 07:23

Nicely done. I noticed notes labeled "list of tripoints" and "extreme points of montenegro" in your changeset that indicate that these where more than one point originally. Might be worth checking again.

Comment from TomH on 23 July 2009 at 08:13

Importing POIs from wikipedia is generally considered to be a bad idea due to the way much of wikipedia's geodata has been sourced - some of it at least is believed to have been derived from sources like google maps. See this recent mailing list thread for a discussion of the issue:

http://www.nabble.com/Wikipedia-POI-import--to23392791.html#a23394016

Comment from Pieren on 23 July 2009 at 09:33

Yes, this is really a bad idea. Many of the Wikipedia geodata are imported from sources not allowing commercial reuse or from googlemaps.

Comment from drlizau on 23 July 2009 at 10:52

I've had private conversations with Mike.
Just to make it clear, this is not for import into the main OSM map, this is for informing mappers in Kosovo what may exist on the ground. If you checked the google map of Prishtina you would see that it is hopeless, so don't imagine anything will be copied from there.

Comment from h4ck3rm1k3 on 23 July 2009 at 10:57

Well,
you can see that I am done.
If you think any of these 10 points have been copied from somewhere illegally I will removed them.
Also the wikipedia people said to me that facts are not copyrightable.
Yes, I used a bounding box and not the polygon. I could use osmosis to filter this dataset again with the boundry.
the result of this exercise is that there is hardly any data for kosovo in the wikipedia.
thanks,
mike

Comment from h4ck3rm1k3 on 23 July 2009 at 10:58

and you don't have to approve this changeset,. btw.
it looks like only there are two or three interesting points.

Comment from HannesHH on 24 July 2009 at 14:51

Facts not, data yes. OSM is just factual data too. ;)

Comment from Richard on 25 July 2009 at 16:34

"the wikipedia people said to me that facts are not copyrightable"

The wikipedia people are not renowned for their understanding of the complexities of geodata law. ;)

Log in to leave a comment