OpenStreetMap logo OpenStreetMap

h4ck3rm1k3's Diary

Recent diary entries

WaysToRelation program:

This new tool, in c++ with a two pass high speed, low memory sweep of the xml using sax.
It does not produce the relations just yet. All I have to do now is add each way that I process in the second pass to a relation. I build up already a vector of nodes and each node is looked up on a map of nodes by coords and id.
The ways are not stored longer than to process one.
The array of nodes is needed to collect first the tagged attributes.

Tags each node as to how many ways reference it, and declares a way as owner, and also points duplicate points based on the location to the first declaration of the node.

I am thinking about how to do this all in one pass, because it should be possible to process each way the first time. When a way is split, the relations are created. When the next way references that node, the relation will have to split that line. that is why we need a second pass. Otherwise we would have to store the ways and split them in memory. I have not considered that, but still you would have to process them again. It could be less processing time than two passes.

The second pass emits new ways based on contiguous blocks of nodes.
The connection to the other paths are also established.

The ways are given new ID numbers, counting down.

Invoked like this:
./WaysToRelations ../nj_zip/new/tl_2009_34_zcta3.osm > tl_2009_34_zcta3.osm 2>err.txt

The file produced is uploaded here to :

time ./WaysToRelations ../nj_zip/new/tl_2009_34_zcta5.osm > tl_2009_34_zcta5.osm 2>err.txt

real 2m0.315s
user 0m48.235s
sys 1m11.832s

time wc ../nj_zip/new/tl_2009_34_zcta5.osm
1,249,669 lines 5,598,205 words 59,986,381 mb ../nj_zip/new/tl_2009_34_zcta5.osm

real 0m3.638s
user 0m3.536s
sys 0m0.040s

So you can see that it uses alot time. 2 Minutes is alot, but 1,2 million lines processed in xml. The resulting file :
http://www.archive.org/details/Tl_2009_34_zcta5.osmSplit
I am uploading to archive.org

See full entry

Simple TagReport in C++

Posted by h4ck3rm1k3 on 27 December 2009 in English.

I have checked in a simple tag report to process the tags for JohnSmith
the idea is to extract the data about the highways for signs. Well this routine is pretty generic, and demonstrates how you can build simple reports in C++ with my new toolkit.

I created a TagWorld object to collect the tag data, there I could each tag and the values of certain tags, The tags to collect are stored in a third map.

So, the TagProcessor is used for the ways, rels and nodes, the same type in the callbacks. The results are incremented in the world object and reported in the end document callback.

It all works very nice,and i hope that more people will use this toolkit.

For example, I processed the entire new_jersey dump file,
it takes : 2m45.288s

wordcount takse:1m33.406s to process 18million rows
18,612,601 78918048 1249629139 new_jersey.osm

So this program is only 2x slower.. that is pretty good.

test it yourself :
1. bzr branch lp:~jamesmikedupont/+junk/EPANatReg
2. make
3. ./TagReport test.osm

mike

http://bazaar.launchpad.net/~jamesmikedupont/%2Bjunk/EPANatReg/annotate/head%3A/TagReport.cpp

RTree now running

Posted by h4ck3rm1k3 on 27 December 2009 in English.

I have been able now to add in the rtree to the processing :
using the one from roadnav :
URL: https://roadnav.svn.sourceforge.net/svnroot/roadnav/libsdbx/trunk

The IndexWaysWithRTree reads in two files :
1. the osm file with ways that are converted to bboxs and stored in the rtree
2. the osm file with points that are looked up in the rtree.

it is running quite fast, I have reworked the processing function to take a world object reference so that the data can be reused.

Also, I have experimented with the perl modules :
Algorithm::BreakOverlappingRectangles in BreakOverlapping.pl
and
Algorithm::RectanglesContainingDot in buildindex.pl,
they both take the bounding box text output of the ExtractWays from a given osm file.

mike

I have made an update to the sax processing of the OSM.
Now you have a nice way to plug in your own processing ,at high speed without any funny overhead!

The OSM processor is a template function that

has four template parameters :

http://bazaar.launchpad.net/~jamesmikedupont/%2Bjunk/EPANatReg/annotate/head%3A/OSMProcessor.hpp

Each of those parameters are passed to a similar template SAX callback processor
http://bazaar.launchpad.net/~jamesmikedupont/%2Bjunk/EPANatReg/annotate/head%3A/SAX2OsmHandlers.hpp

It creates a world object, and each type of processor, passing the world to them by reference so they can access it. Then it parses the xml and calls the user defined templates on call back.
each one gets the chance to handle the StartElement callback from sax and do what they want. You dont have to do anything and that makes it fast.

So, now you can just define your own plugins as you like. If you want to have two plugins, you can define a new class that combines two of them. It should be very simple to make incremental and standalone processors for OSM data that are very fast.

Now that this framework is started, I invite you all to help out.

Mike

Very fast osm processing in C++

Posted by h4ck3rm1k3 on 24 December 2009 in English.

Hi there, I have started a to process the OSM files in c++
here is an example of a very fast processor to look for duplicate coords and IDS
http://bazaar.launchpad.net/~jamesmikedupont/%2Bjunk/EPANatReg/annotate/head%3A/CheckDuplicates.cpp

For the new_jersey_admin.osm from cloudmade, it processes in 5 seconds
time ./a.out ../new_jersey_admin.osm 2> report.txt
real 0m5.127s

With wordcount: 2.1 seconds (393773 lines)
time wc ../new_jersey_admin.osm
393773 1974640 30893709 ../new_jersey_admin.osm
real 0m2.196s

with xmllint :
time xmllint ../new_jersey_admin.osm > lint.osm
real 0m3.192s

I have started with the algorithm for the processing of ways and creating relations, but will be writing this in C++ instead of perl.
Here is the start, I only tested in on two counties so far :

http://bazaar.launchpad.net/~jamesmikedupont/%2Bjunk/EPANatReg/annotate/head%3A/WayToRelation.pl

It just looks for points in a row that are shared. The first version was bruteforce, comparing all the points. This one looks ascending in one, and descending in the other. I did not test on many counties because I would like to do this with a large osm file in C++.

In perl, with the sax parser, it takes 1 minute to process the file
time perl testsax.pl ../new_jersey_admin.osm 2> report.txt
real 1m1.225s

http://bazaar.launchpad.net/~jamesmikedupont/%2Bjunk/EPANatReg/annotate/head%3A/testsax.pl

Merry Christmas,
mike

Just woke up, been dreaming about OSM.

1. For the ZCTA (zip code tabulation areas) instead of adding the areas to OSM,
process the streets and mark them as follows :

1.a tag "is_in:zcta:2009"=012345 for the streets completely inside a ztca

1.b two tags, left and right "boundary_(left|right):zcta:2009"=01234 add a new node to any Boundary points of the street which intersect a ztca.

1.c tag "FIXME:zcta:2009" for any points that conflict with existing zipcode data.
Here are some conflicts :
1.c.1 : The existing tiger zipcode/post_code attribute does not match the ZCTA
1.c.2 : The street is not inside any ZCTA.
1.c.2 : The street needs to be split up into parts between multiple ZCTA and create relations instead.

So that would just markup the existing streets with the zcta data and allow them to be processed later without causing any new data to be displayed on the map.

The creating of ztca relationships using the existing streets is the next step
1.d the tag "border_left:zcta:2009"=01234 for any street that is along the border of a zcta on the left hand side. analogous for the right hand side.

Now that is for those.

The other thing will be to render each zcta as a tile, or a set of tiles using the bounding box. The tiles can be hosted statically on archive.org. Even more we can use open layers as well to pull in more data from OSM.

The static html pages hosted on archive.org can be enhanced so that users can provide feedback via email. We can include openlayers as well to allow for dynamic behaviour, including the latest OSM data as well. If we do it right, we can also replace old data on the fly using js.
Each one will then have a way to report problem, each node will have a way for the user to select a street or point and say : this belongs to zipcode X. Then they can submit it via email to designated email address to be processed.

See full entry

Location: Longport, Atlantic County, New Jersey, 08403, United States

Here, now we have the start of a tool to extract the admin borders into polys from the cloudmade admin files. You can tweak it yourself to emit any way into a polygon.

https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg

The Osm2Poly now can work with the NJ county borders from cloudmade.
First you extract them, all the parts into an out dir
and then use the merger to create a file with all the subpolys.

#first we process the admin border, all nodes that are not know will be reported and you can
perl osm2poly.pl ../new_jersey_admin.osm

# then we run the get routine to pull all the missing nodes
bash get.sh

# then we merge all the osm files into one nice one
cat fetch_*.osm ../new_jersey_admin.osm | perl osmsimplemerge.pl > new.osm

# then we convert them to poly files, only things that are named will get processed, any tages not know will be emitted as regexs for you to past into the script: eg "elsif ($1 eq 'is_in:state_code'){}"

if you want to emit the node, then fill out the "$poly_file=$2; $poly_id=$2;"

perl osm2poly.pl new.osm

# the new version of the merger tool takes a name, not only for zipcodes
perl polymerger.pl NJADMIN

# then we create osm and bb files from all the polys
for x in merge/*.poly; do echo $x; perl poly2osm.pl $x> $x.osm; done
for x in merge/*.poly; do echo $x; perl poly2bb.pl $x> $x.bb; done

I am uploading the results now, they will be published here:
http://www.archive.org/details/NJ_Counties

Some of the counties are not complete, they do not contain all the parts, like
osm.org/browse/way/32209341 Rockland County...

Location: Upper Freehold Township, Monmouth County, New Jersey, United States

Polygon files for NJ ZCTA on the way

Posted by h4ck3rm1k3 on 22 December 2009 in English.

I have now polygon files for the 5 digit zipcodes census tabulation areas
and am uploading them to archive.

http://fmtyewtk.blogspot.com/2009/12/shapfiles-for-nj-zcta-zip-code.html

The nice thing about these files is that they work just fine for the zip codes that I have looked out (long hill township 07933)

Using these polygons and the bounding boxes, we can split up the OSM file and
also check the tiger import. I found alot of missing zipcodes and even some that were wrong in the tiger import.

I will be creating OSM files from the NJ osm data from cloudmade next, so will spit the dump up by zipcode.

Also will be producing county polygons as well.

The code is now checked in, yes it is a hack, but it is a hack of a hack.

mike

Hi,
Today I have patched two tools :
1. Osm2Poly so that it processes the results of the Shp2Osm from the census data.
http://fmtyewtk.blogspot.com/2009/12/using-osmosis-to-split-zipcode-files.html

2. Osm2PgSql so that it processes the results of the Shp2Osm from the census data.
http://fmtyewtk.blogspot.com/2009/12/osm2pgsql-hack-for-importing-id-ways.html

And as the result, I can now split osm data via the ZipCode.
The next step I need to check these ZipCodes with multiple ranges.
And the I will upload the results for NJ to the archive.org server.

mike

Location: -74.531, 40.666

New Host for OSM data , archive.org

Posted by h4ck3rm1k3 on 20 December 2009 in English.

I have been looking for a place to share OSM layer files, and have found a perfect spot for them : Archive.org

Archive.org supports unlimited size files and allows you to even upload videos. They support all creative commons licensed data.

I have started to upload my OSM conversion of the US Census Buro ZIPcode database. Anyone who wants the data can get it, and those who don't want the data don't need to see it. The only problem is that people wont find this data when they are looking for it. Hopefully they will find this post, but otherwise that is the smallest problem to start with.
http://www.archive.org/details/OpenstreetmapZipcodes

I will be using the Zipcode to split up the EPA files so that we have a handy mechanism to select what files you are interested in. But first, I am downloading all of the individual data records and processing them into something usable. I would like to use that data to doublecheck the containing zipcode and county data to flag all problems.

That way people will not have the OSM flooded with "junk" data, yet we will have a place to share this data with each other.

The delete of the EPA data is finished, sorry about the problems that caused.

In the end, We should be able to extract all the zipcode regions and all the things they contain, so that we would have a tool to find all streets that belong to what zipcode.

With that data we would have a great tool for processing and verifying Address data and make OSM even more valuable.

mike

removing the EPA data from OSM

Posted by h4ck3rm1k3 on 18 December 2009 in English.

Due to popular request, I am removing the EPA changesets from OSM.

That does not mean I have given up, but it means that this data needs to be rethought as to how to use it.

I have never dealt with such a huge and complex set of data map before, and it requires more thought.

I am working at the moment on downloading the record data from the EPA, I have created a parser to convert that HTML into a database.

But, I am *not* going to upload these 100k points into OSM again, but I do think that people will be interested in the POIs going forward and that if we can find a way to crowdsource the checking of this data it will be a great benefit to the people.

mike

I have been looking for the shape files for the Bull Shoals Lake, and Norfork lake in Arkansas.

http://water.usgs.gov/GIS/dsdl/ds240/index.html

It is mentioned here :
http://msdis.missouri.edu/datasearch/metadata/utm/st_lake_cls_use.xml
"The following water bodies were included in the NHD and staff edited the polygons with GIS software to more accurately depict the water body.
"
7317 Norfork lake
7315 Bull Shoals Lake

I have used the shp2osm converter here:
http://svn.openstreetmap.org/applications/utils/import/shp2osm/shp2osm.pl

And I have written a filter to extract only water out of those maps :
http://bazaar.launchpad.net/%7Ekosova/%2Bjunk/openstreetmapkosova/revision/100

The import looks good, and I am running it only Bull Shoals at the moment, using a coastline relationship.

If there is interest, we can import more.
osm.org/browse/changeset/3355937

Location: Norwalk, Stone County, Missouri, 65747, United States

open letter to the EPA

Posted by h4ck3rm1k3 on 12 December 2009 in English.

Submitted to :
http://iaspub.epa.gov/enviro/ets_grab_error.smart_form

Hello,
I have imported the data from the KML file http://www.epa.gov/enviro/geo_data.html into openstreetmap. For example this change set :osm.org/browse/changeset/3350800 But many of the items are not even there any more.

In order for me to determine more information, I will have to do a massive query on your database. Could you not please help me out and provide me with information on what sites are worth to put on the map, like if the factory is even still there?

thanks,
mike

Next Project for the EPA and Mine data

Posted by h4ck3rm1k3 on 11 December 2009 in English.

I would like to suggest a next step for the mine data,
to drill into the webpage and extract the item extracted.

for example:
http://tin.er.usgs.gov/mineplant/show.php?labno=5927
Site name Murray Mine
Company name Murray Mines, Inc.
Mine or Plant M/P
Commodities
Commodity Sand and Gravel

This could be parsed and added into OSM via the
ref=* and operator=* and resource=*

Thanks to goldfndr__ for his advice.

Also for the EPA hazards, you have this data :

http://iaspub.epa.gov/enviro/national_kml.registry_html?p_registry_id=110000801222

The data about each hazard is for example :
FACILITY PROGRAM INTERESTS
PROGRAM INTEREST PROGRAM ID
RESOURCE CONSERVATION AND RECOVERY ACT INFORMATION SYSTEM
HAZARDOUS WASTE LARGE QUANTITY GENERATOR NJ0000061846

but about each site there is lots of data:
http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?p_registry_id=110000801222

http://www.epa.gov/osw/hazard/generation/lqg.htm
Photo of drums containing corrosive wastes.

See full entry

EPA Bulk Import

Posted by h4ck3rm1k3 on 11 December 2009 in English.

Hi all,

I have now imported about 100k of nodes that describe factories and building that are on the EPA watchlist.

http://www.epa.gov/enviro/geo_data.html

This import is not perfect, the names are in CAPS, the symbols are not perfect.
Of course it could be better, but the information is there and the locations are from the EPA. This information is important for purchasing property for example or estimating the value of a property, because of the environmental impact of a industry might have a negative effect on a property, even long after it has changed its name.

The same applies to bomb impact sites from depleted uranium and other nice things like that.

My vision is that we should be able to have these factors in OSM for later helping pricing of properties.

People should be able to use OSM to be able to choose a safe place to live or to evaluate where to buy a house.

Now with the mine and quarries and EPA sites imported, you will have some better idea of what is around you.

All the best,
Mike

Hi all,

I have tested my renaming script for openstreetmap to use the Albanian
names from the global database of names.

The script runs like this :

1. mkdir cleanup
2. mkdir cleanup/parts
3. perl ./mergenames3.pl kv.txt parts/part_115.osm

Where the parts/part_115.osm is downloaded from osm.
I have created a split routine to create a splitup of kosovo so you
can process it individually.

to create the splits, you run this script :
1. mkdir parts
2. bash KosovoSplitterGet.sh

But you should delete any parts that you have before.

The kv.txt is from the global naming system and is checked in.

You can see the results here :
osm.org/browse/changeset/3350781

Now, I am not going to process all of kosovo. If someone wants to help
rename the openstreetmap things, just follow these instructions. If
the are not clear enough let me know and I will explain them more.

You can find all of the code in bzr:
bzr branch lp:~kosova/+junk/openstreetmapkosova

So, also I know these scripts are not very clean, and need to be
rewritten in python or even in perl. All help is welcome.

mike