Deník uživatele miku0

Nedávné deníkové záznamy

GSoC'23: Final Report

Zapsal miku0 27. 8. 2023 v jazyce English. Naposledy aktualizováno 29. 8. 2023

Hi everyone,

I am thrilled to present the final updates on my ongoing project focused on enhancing the search experience in Japan. For reference, you can find the previous blog entry here and the interim report here. In the last interim report, I introduced the changes related to the sanitizer. The associated pull requests for these changes can be accessed here. Regarding this article, you can find the pull request for this specific change here.

In this final update, I will provide detailed insights into our efforts in the “Address Search” realm, specifically related to the tokenizer.

Enhancements in Address Search Methods in Japanese

Japanese Addresses are written in one large string without spaces and Nominatim needs help to find the words. As an illustration, for instance, consider the Japanese address “東京都新宿区西新宿２丁目８−１”. Although it lacks spaces, it can be divided into internal segments such as “東京都” (similar to a state or district), “新宿区” (akin to a city), “西新宿” (akin to a town), and “２丁目” (a block). However, Nominatim only divides such addresses using ICU (International Components for Unicode) transliteration, not based on this semantic division. Fig.1 illustrates a debugging example with multiple potential candidates.

… Zobrazit celý záznam

GSoC'23: Midterm Update

Zapsal miku0 13. 7. 2023 v jazyce English. Naposledy aktualizováno 17. 7. 2023

Hi everyone!

This is a blog post providing an update on the progress of my project, which aims to improve the search experience in Japan. My project is 12 weeks long, and we are currently in the 7th week.

How did we approach it?

Based on our research, we identified two key areas for enhancing the search experience. These two aspects are interconnected, and we are working on resolving them simultaneously.

Searching for addresses: Currently, Nominatim focuses on the addr:street and addr:place tags when searching for addresses. However, in Japan, addresses are primarily based on block addresses, and the street component is less significant. Therefore, we need to ensure that Nominatim can appropriately assign the correct parent when conducting searches.

Importing data: Nominatim generates a database from an OSM planet file. To accommodate the block address system in Japan, we are adding a new sanitization function to adjust Japanese addresses. This will ensure that the data generated from the OSM planet file aligns with the block address structure, similar to the changes made on the searching side.

Progress Update

… Zobrazit celý záznam

GSoC'23参加者からのご挨拶

Zapsal miku0 30. 5. 2023 v jazyce Japanese (日本語).

ご挨拶！

皆さんこんにちは！私は日本で大学院生をしているMikuです。私はGSoCを通して、日本における住所検索アルゴリズムを改善する予定です。このような機会をいただけてとてもわくわくしています。

私のプロジェクトについて

皆さんもご存じの通り日本の住所は独特で世界的に一般的な住所の仕組みと異なり、Block addressが基本となっています。しかしながら、OSMの住所検索アルゴリズムであるNominatimはこれら日本独自の住所システムに対応しておらず日本の住所を正しく検索することは困難です。そこで私のGSoCのプロジェクトでは検索アルゴリズムに日本の住所に対応する機能を追加することでこの問題に取り組みます。これらの機能の追加方法はこちらのリンクを参照することができます。

私のプロジェクトの目標

houseenumber、block_number、neighborhoodなど、日本特有の構成要素に適切にタグを設定できるsanitizerを開発する
日本の住所構造に基づいて適切なフォーマットを生成できるtokenizerを実装する
(Option）tokenizerに、中国語と日本の漢字を区別する機能を追加する

お読みいただきありがとうございました。この問題はとても日本独自のシステムに基づいています。もし何かコメントがありましたらご指摘いただけますと幸いです。

Greeting from GSoC'23!

Zapsal miku0 30. 5. 2023 v jazyce English.

Hello!

Hi everyone!

My name is Miku, and I’m currently a master’s student at Tokyo University of Science in Japan. During my free time, I enjoy watching anime and reading books. I’m incredibly excited about the opportunity to contribute to OSM community!

What I will do?

My project involves developing the search experience in Japan, which has a unique address system. Nominatim, the geographic search engine of OSM is using tags ‘addr:street’ and ‘addr:place’. However, the Japanese address system does not have the same system of these specific categories. Consequently, searching for addresses in Japan becomes challenging. My primary focus will be on addressing this issue with the Japanese block number address system by developing sanitizer and tokenizer.

Here are the milestones for my project:

Develop a sanitizer capable of properly setting tags for Japanese specific components such as housenumber, block_number, and neighbourhood
Implement a tokenizer that can generate appropriate formats based on the Japanese address structure
(Optional) Add functionality to the tokenizer to differentiate between Chinese characters and Japanese characters Thank you for taking the time to read my introduction! I am eager to collaborate with everyone in this community. These problems are closely related to Japanese localization, and this is my Japanese introductory page. If you have any questions, please feel free to ask them in the comments!