OpenStreetMap logo OpenStreetMap

Recently someone asked how to find OSM users who’ve left a changeset comment, but have not edited themselves. (Technically the initial challenge was for a one line bash script 😉).

Here’s how to do it.

In OpenStreetMap, people can change their username, but OSM data provides an unchanging numeric user id (uid) for users, which we use here.

First download the dump file, from the OpenStreetMap data download serivce (planet.osm.org) ⁽¹⁾.

aria2c --seed-time=0 https://planet.openstreetmap.org/planet/discussions-latest.osm.bz2.torrent

This will download discussions-YYMMDD.osm.bz2⁽²⁾, which is currently about 5 GiB.

I had to write a new tool, anglosaxon to easily parse large XML files like this into a TSV file format⁽³⁾. This programme works on all XML files, maybe it’s useful for other problems you might have. Install that first.

bzcat discussions-220110.osm.bz2 \
  | anglosaxon \
   -S -o changeset_id --tab -o changeset_uid --tab -o comment_uid --nl \
   -s comment -v ../../id --tab -V ../../uid NO_CHANGESET_UID --tab -V uid NO_COMMENT_UID  --nl \
   | gzip > changeset-comments.tsv.gz

This took about 45 minutes to run on my machine, and the output is about 4 MiB (19 MiB uncompressed), and has about 805,000 lines. This step takes the longest.

We create the list of all uids who have opened a changeset:

zcat changeset-comments.tsv.gz | cut -f2 |uniq|sort |uniq > changeset-uids.tsv

Then a list of all uids who have left a changeset comment:

zcat changeset-comments.tsv.gz | cut -f3 |uniq|sort|uniq > comment-uids.tsv

Then we compare, what’s in one but not the other.

comm -13 changeset-uids.tsv comment-uids.tsv |sort -n > uids-comment-without-changeset.tsv

Et voilà! Sin é! And there’s your results. 🙂 The file is 29 KiB, and has ~3,500 entries. I’m surprised it’s so high.⁽⁴⁾

You can find all the changesets that a uid has commented on with this command, (replace UID with the uid)

zcat changeset-comments.tsv.gz | grep -P "\tUID$" | cut -f1

e.g. the comments that uid 23770⁽⁵⁾ has commented on:

zcat changeset-comments.tsv.gz | grep -P "\t23770$" | cut -f1

The OSM API has several methods to get details on an OSM user, e.g.:

curl "https://api.openstreetmap.org/api/0.6/user/23770"

⁽¹⁾ Here we use aria2c which will do a regular web/HTTP download, and also use BitTorrent P2P decentralized downloads in addition. --seed-time=0 stops aria2c when the file is fully downloaded, rather than sharing/seeding the file forever over BitTorrent

⁽²⁾ YYMMDD is the year, month & date that the data was created

⁽³⁾ tab separated values, like CSV, but with tabs

⁽⁴⁾ If you’re curious of how to do that on one line using bash(1)’s Proccess Substitution:

comm -13 <(cut -f2 changeset-comments.tsv |uniq|sort|uniq) <(cut -f3 changeset-comments.tsv |uniq|sort|uniq) | sort -n > uids-comment-without-changeset.tsv

⁽⁵⁾ My user id

Email icon Bluesky Icon Facebook Icon LinkedIn Icon Mastodon Icon Telegram Icon X Icon

Discussion

Comment from Eiim on 22 January 2022 at 21:54

@Marketplace NFT commenting here is quite ironic. Of course, this is a diary comment, not a changeset comment, but it’s a similar idea. Hopefully the recent wave of diary spam can be cracked down on.

Comment from TheSwavu on 23 January 2022 at 02:53

Nice. I always like me some Bash magic…

I’ll have to have a look at anglosaxon. I have some Python code I use to parse the changeset dump and I’ll be interested to see how much faster your program goes.

Comment from amapanda ᚛ᚐᚋᚐᚅᚇᚐ᚜ 🏳️‍⚧️ on 23 January 2022 at 11:45

@TheSwavu: IME one of the biggest time consuming tasks in decompressing the bz2 input file. I didn’t write it here, but I’m pretty sure you can get faster performance with cat whatever.osm.bz2 | pbzip2 -d -c - | … if you install pbzip2, which is obv. in debian etc.

I’m pretty sure there’s many ways anglosaxon could be sped up, I haven’t properly investigated speeding it up yet.

Comment from bryceco on 23 January 2022 at 18:36

I didn’t know about pbzip2, which I agree might be a big time saver. But the documentation says the input needs to have been compressed with pbzip2 to yield a speed up during decompression. Is that the case for the planet file dumps?

Comment from amapanda ᚛ᚐᚋᚐᚅᚇᚐ᚜ 🏳️‍⚧️ on 23 January 2022 at 18:40

The planet osm.bz2 files? I don’t know… try it out and see?

Log in to leave a comment