I have been working on some code to detect if a changeset is an import, SPAM, or if it has a tagging error.
https://github.com/jremillard/osm-changeset-classification
Detecting SPAM and tagging errors is pretty straight forward. However, detecting imports is much more challenging. Before I started, I thought I knew what an import was. I was looking for large changesets, that only added 1 or two kinds of data. However, this criteria performs poorly in practice. In OSM many import changesets are not large, also it is not uncommon that the imported data has some hand editing mixed it.
My new definition of an import
An import is any addition to OSM that directly derives from other digital map sources.
討論
由 Glassman 於 2018年04月17日 05時39分 發表的評論
If I use the TIGER background image, provided in both iD and JOSM, to determine geometry as well as road name, is this an import?
由 DevonshireBoy42 於 2018年04月17日 08時25分 發表的評論
What do you want to do with flagged imports? If I do a small import of one village or town and manually check, conflate and edit every building then it lacks the issue that large or automated imports have.
由 Zverik 於 2018年04月17日 10時19分 發表的評論
There are no imports. Import is an invented construct made by Germans to try to keep their map in check. That’s why no matter what algorithm you choose, you’d get tons of false positives and false negatives.
由 Stereo 於 2018年04月17日 15時07分 發表的評論
I think it’s very interesting that the import guidelines don’t actually define what the term means.
由 Nakaner 於 2018年04月17日 17時24分 發表的評論
I agree that the size alone is not helpful. I regularly check my OSMCha filters for changesets with more than 9000 additions and many of them are HOT mappers tracing buildings and uploading them after they finished editing.
jremillard wrote: > An import is any addition to OSM that directly derives from other digital map sources.
I would append:
Otherwise people will try to define Bing imagery as a “digital map source”. :-)
However, that criteria is difficult to translate into rules a computer can apply. That’s my personal list of criteria to define a bad import:
Unfortunately, our rules don’t require users to add a tag to the changeset indicating the documentation and discussion of the import. If so, we could look for changesets which look like imports but lack that tags. I would call these tags:
import:documentation=<page title at wiki>
import:discussed:<mailing_list>=<date of first posting on imports@ mailing list>
由 Glassman 於 2018年04月17日 18時18分 發表的評論
@Nakaner - At a minimum having a tag: import= should be sufficient. Or even the import page url to simplify getting to the page to see details of the import.
I applaud the effort to use software to detect imports. However, we need to be careful. False positives could cause angry comment directed at the editor who did nothing wrong.
Clifford
由 Zverik 於 2018年04月17日 18時48分 發表的評論
Well, this applies to all but the first four items on your list. And the fourth one is questionable.
And you are starting to discuss imports, not their detection.
Again, I am pretty sure you cannot tell a proper import from a regular edit. Regarding the source cirteria, you never know what a mapper used for tracing or tagging, the same as with imports.
由 Nakaner 於 2018年04月17日 18時56分 發表的評論
The forth item (I should have written “key”, not “tag) is an easy way to find users importing shape files. As you might know, field names of shape files are limited to 10 characters. Sometimes things go completely wrong and people end up uploading objects with uppercase keys or keys ending with
~1
.That’s not wrong. I have difficulties and write changeset comments even if I am sure. There are HOT mappers uploading thousands of buildings in one large changeset.
由 dieterdreist 於 2018年04月17日 22時22分 發表的評論
jremillard wrote:
An import is any addition to OSM that directly derives from other digital map sources.
I think this definition has to be extended, because you can also import other information if you are able to assign positions to it (or relate it to OSM objects)
由 dieterdreist 於 2018年04月17日 22時24分 發表的評論
for me an import is adding data from somewhere when you didn’t check every part individually
由 jremillard 於 2018年04月18日 02時20分 發表的評論
Thanks for all the comments!
@Zverik - The vast majority of imports (probably over 95%) are detectable. However, a knowledgeable person that wishes to make the import hard to detect certainly can. Obliviously, it is impossible to know how often this happens.
@Stereo - I agree that the fact that the term isn’t clearly defined is interesting.
@DevonshireBoy42 - I have no plans on what to do with the detector and we will see if it goes anywhere useful.
@Glassman - Pulling road names from Tiger is a kind of import, but it doesn’t need to follow the import guidelines we all know that it is OK because Tiger is public domain. However, pulling road names from google, isn’t ok. For small imports we skip the import guidelines and deal with them by reverting them after the fact if they have problems.
Finally, the word “directly” would exclude tracing over an image layer.