-
-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Hi,
I am using sudachi-rs as part of a neovim plugin to help with japanese learning.
I am trying to understand/see all possibilities for the part of speech component to declare my own enums and so on.
I've searched for an enum listing the lexicon types for instance (like 名詞, 助詞, 補助記号 , I want to know what are the other possible values) but I couldn't find it in here, or in sudachi dict. Unti l I reached https://github.com/WorksApplications/Sudachi but seems like there is no enum whatsoever, the 名詞 is just part of the dictionary ? Seems like the part of speech is just a list of freeform strings ? There must be some convention though, where can I find such a list ?
My goal is to reproduce in neovim the output of https://www3.nhk.or.jp/news/easy/ne2025073011585/ne2025073011585.html , ie. where locations or people names are highlighted differently .
I wonder if the tokenizer could output json on top of the current format (e.g., with --output=json) ? Might not be good for perf but json would self-document the various part of speech fields.