Skip to content

Conversation

@fondoger
Copy link

@fondoger fondoger commented Mar 30, 2025

Currently, the text normalize algorithm will simply replace original text with normalized text. This behavior causes the generated timestamps not align with the original timestamps.

Kokoro supports embedding phonemes in the text, and the token timestamps is based on the original text.

  • Original Input Text: [Misaki](/misˈɑki/) is a G2P engine designed for [Kokoro](/kˈOkəɹO/) models.
  • Text For Timestamps: Misaki is a G2P engine designed for Kokoro models.

Before this PR:

Text:  The price will be $100 after 9:30PM.
word    start_time      end_time
The     0.0005416666666666625   0.07554166666666667
price   0.07554166666666667     0.3880416666666666
will    0.3880416666666666      0.4880416666666667
be      0.4880416666666667      0.6380416666666666
one     0.6380416666666666      0.8255416666666666
hundred 0.8255416666666666      1.1255416666666667
dollars 1.1255416666666667      1.8505416666666668
after   1.8505416666666668      2.188041666666667
nine    2.188041666666667       2.5255416666666664
thirtyPM        2.5255416666666664      3.5255416666666664
.       3.5255416666666664      3.6755416666666667

Note that $100 is mistakenly shown as one handred, and 9:30PM is shown as nine thirtyPM

After this PR:

Text:  The price will be $100 after 9:30PM.
word    start_time      end_time
The     0.0005416666666666625   0.07554166666666667
price   0.07554166666666667     0.3880416666666666
will    0.3880416666666666      0.4880416666666667
be      0.4880416666666667      0.6380416666666666
$100    0.6380416666666666      1.8505416666666668
after   1.8505416666666668      2.188041666666667
9:30PM  2.188041666666667       3.5255416666666664
.       3.5255416666666664      3.6755416666666667

Note that both the $100 and 9:30PM is correct now.

@fondoger
Copy link
Author

fondoger commented Mar 30, 2025

@remsky, @fireblade2534 Please review this PR. I tested it locally and the result is good.

@fireblade2534
Copy link
Collaborator

I can't test it out right now but ill test it out tmrw.

Copy link
Collaborator

@fireblade2534 fireblade2534 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great in concept but there are a few issue texts that I want to highlight:

  • Running on localhost:7860 -> Running on [localhost:[7860](/sˈɛvənti ˈeɪt sˈɪksti/)](/lˈoʊkɐlhˌoʊst kˈoʊlən sˈɛvən θˈaʊzənd ˈeɪt hˈʌndɹɪd sˈɪksti/)
  • Email me at [email protected] -> Email me at [user@[example-com](/ɛɡzˈæmpəl dˈɑːt kˈɑːm/)](/jˈuːzɚɹ æɾ ɛɡzˈæmpəl dˈɑːt kˈɑːm/)
  • Oh yeah I have $500.60 in my bank account -> Oh ye'a I have [$[500.60](/fˈaɪv hˈʌndɹɪd pˈɔɪnt sˈɪks zˈiəɹoʊ/)](/fˈaɪv hˈʌndɹɪd ænd wˈʌn dˈɑːlɚz ænd sˈɪksti sˈɛnts/) in my bank account

What happens with both of those (and will happen in more cases) is that it normalized for example localhost:7860 but since the text was still in [localhost:7860] the number normalizer came along and normalized the number. This is an inherent issue because of the way that the normalizer / you code work. The code does handle custom phonemes, see text_processor.py:handle_custom_phonemes and get_sentence_info.

@fondoger
Copy link
Author

fondoger commented Apr 1, 2025

Thanks for the review. I'll check if I can think of better solutions to handle these cases.

@fondoger
Copy link
Author

fondoger commented Apr 1, 2025

Just find out that the original Kokoro itself can already handle some basic normalizations.

Try it here: https://hexgrad-kokoro-tts.hf.space

  • Email me at [email protected] -> ˈimˌAl mˌi æt jˈuzəɹ æt ɪɡzˈæmpəl dˌɑt kˈɑm
  • Oh yeah I have $500.60 in my bank account -> ˈO jˈɛə ˌI hæv fˈIv hˈʌndɹəd dˈɑləɹz ænd sˈɪksti sˈɛnts ɪn mI bˈæŋk əkˈWnt

Maybe we can simply disable normalizations in Kokoro Fast API.

@fireblade2534
Copy link
Collaborator

Disabling normalizations in kokoro-FastAPI has always been an option. The readme has a section on how to do it

@fireblade2534
Copy link
Collaborator

Thanks for the review. I'll check if I can think of better solutions to handle these cases.

I would suggest hijacking the current system for preserving custom phenomes

@fondoger fondoger marked this pull request as draft April 3, 2025 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants