GSoC'23: Final Report
Zapsal miku0 27. 8. 2023 v jazyce English. Naposledy aktualizováno 29. 8. 2023Hi everyone,
I am thrilled to present the final updates on my ongoing project focused on enhancing the search experience in Japan. For reference, you can find the previous blog entry here and the interim report here. In the last interim report, I introduced the changes related to the sanitizer. The associated pull requests for these changes can be accessed here. Regarding this article, you can find the pull request for this specific change here.
In this final update, I will provide detailed insights into our efforts in the “Address Search” realm, specifically related to the tokenizer.
Enhancements in Address Search Methods in Japanese
Japanese Addresses are written in one large string without spaces and Nominatim needs help to find the words. As an illustration, for instance, consider the Japanese address “東京都新宿区西新宿2丁目8−1”. Although it lacks spaces, it can be divided into internal segments such as “東京都” (similar to a state or district), “新宿区” (akin to a city), “西新宿” (akin to a town), and “2丁目” (a block). However, Nominatim only divides such addresses using ICU (International Components for Unicode) transliteration, not based on this semantic division. Fig.1 illustrates a debugging example with multiple potential candidates.