Cleaning up road names in China
Posted by sykul on 14 June 2019 in English. Last updated on 27 July 2020.In China, names should almost always be in Chinese. There are some exceptions like with certain shop names but things like road names should certainly always be in Chinese. I kept seeing the following problems cropping up:
Abbreviated road names like “name:en=W. Shuncheng Str.”.
Mixed Chinese and English/pinyin like “name=文艺路 Wenyi Road”.
English road names using ‘Lu’ instead of ‘Road’ (Lu is the diacritic-less pinyin for ‘road’ in Chinese, often used by English speakers when naming Chinese roads even in English).
English names in the name= tag rather than the name:en= tag.
Finding 2 and 4 could be done together. I initially tried to figure out how to use Overpass Turbo to find a mix of hanzi and Latin alphabet, but then I realised it was much simpler than that because no valid Chinese road name would contain Latin letters, so I just had to search name= for Latin letters. The query is as follows:
[out:json][timeout:25]; way["highway"]["name"~"[A-Za-z]"]({{bbox}}); out center;
Actually, as it turns out there might be some valid road names with Latin letters in them. Namely alphanumeric codes like ‘G203’ or whatever. Whether these should be tagged as name= or ref= I’ll leave someone else to figure out. If the name was mixed English/Chinese and name:en= already existed, I just deleted the English from name=. If there was no existing name:en=, I created that tag and moved the English name into it.
Number 3 is also fairly easy to spot.
[out:json][timeout:25]; way["highway"]["name:en"~"Lu$"]({{bbox}}); out center;
Number 1 required a combination of queries (there’s probably a way to search for them all in one go but I just did it one by one this time). For example:
[out:json][timeout:25]; way["highway"]["name:en"~"W\\."]({{bbox}}); out center;
This finds all roads like “W. Shuncheng Road” (full stops have to be escaped with two backslashes (and also turns out the markdown for OSM diary entries requires you to escape backslashes)).