Skip to content

Hi TN date class accuracy improvement#418

Open
shrpawar-alt wants to merge 12 commits into
NVIDIA:staging/hi_tn_v3from
shrpawar-alt:hi-tn-date-v2
Open

Hi TN date class accuracy improvement#418
shrpawar-alt wants to merge 12 commits into
NVIDIA:staging/hi_tn_v3from
shrpawar-alt:hi-tn-date-v2

Conversation

@shrpawar-alt
Copy link
Copy Markdown
Contributor

What does this PR do ?

Improved Date class accuracy from ~87 % to ~99 % by introducing additional graph coverage for the cases failing earlier.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@shrpawar-alt shrpawar-alt marked this pull request as ready for review April 21, 2026 11:16
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
@mgrafu mgrafu changed the base branch from main to staging/hi_tn_v3 April 28, 2026 19:44
Comment thread nemo_text_processing/text_normalization/hi/data/date/days.tsv Outdated
Comment thread nemo_text_processing/text_normalization/hi/data/date/months.tsv Outdated
Comment thread nemo_text_processing/text_normalization/hi/data/date/months.tsv Outdated
Comment thread nemo_text_processing/text_normalization/hi/data/date/unambiguous_days.tsv Outdated
Comment thread nemo_text_processing/text_normalization/hi/taggers/date.py Outdated
Comment thread nemo_text_processing/text_normalization/hi/taggers/date.py Outdated
Comment thread nemo_text_processing/text_normalization/hi/taggers/date.py Outdated
Comment thread nemo_text_processing/text_normalization/hi/taggers/date.py Outdated
Comment thread nemo_text_processing/text_normalization/hi/taggers/date.py Outdated
Comment thread nemo_text_processing/text_normalization/hi/taggers/date.py Outdated
shrpawar-alt and others added 7 commits May 4, 2026 08:01
…ion of more test cases

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
…mm-yyyy, dd-m-yyyy and mm-yyyy

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

# Create union of suffixes and prefixes
suffix_union = pynini.union(*suffixes_list)
prefix_union = pynini.union(*prefixes_list)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can l36 through here be replaced by string file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I explored pynini.string_file() for both. It works for suffix_union if we add identity columns to suffixes.tsv, since suffix comes after graph_year and doesn't affect year graph path selection. However, for prefix_union, even with identity columns, string_file() caused graph_year_thousands to incorrectly win over graph_year_hundreds_as_thousands for years like 1999, 1920, 1971, because prefix_union comes before graph_year in the concatenation, and the identity transducer from string_file() alters the weight landscape at that point.
I considered adjusting weights to compensate, but since graph_year_thousands and graph_year_hundreds_as_thousands are designed to be mutually exclusive, it was unclear why string_file() was breaking that exclusivity, making weight tuning risky. To keep both consistent and avoid that risk, I've now used pynini.string_map() for both, as it eliminates the intermediate list variables and the verbose open() + pynini.union() block while correctly handling single-column entries and preserving the expected FST behavior.

०४-०३~चार मार्च
25-03-2020~पच्चीस मार्च दो हज़ार बीस
३०-०५-२०७०~तीस मई दो हज़ार सत्तर
12-07-1970~बारह जुलाई उन्नीस सौ सत्तर
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case has been restored.

12-07-1970~बारह जुलाई उन्नीस सौ सत्तर
०९-१२-२१०१~नौ दिसंबर इक्कीस सौ एक
23-08-2024~तेईस अगस्त दो हज़ार चौबीस
१०-२९-२०००~अक्टूबर उनतीस दो हज़ार
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete these?

Copy link
Copy Markdown
Contributor Author

@shrpawar-alt shrpawar-alt May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

१०-२९-२०००, 11-14-1100 (MM-DD-YYYY cases):
These were removed following your earlier feedback on MM-DD: "if we can't normalize all, does it really make sense to cover for some?" The same applies to MM-DD-YYYY: since we can only handle cases where the day is unambiguously > 12, the coverage is partial. To stay consistent with that decision, we removed MM-DD-YYYY support as well.

०३-२०१०, 11-2024 (MM-YYYY cases):
These were removed because MM-YYYY is not a standard date format and is highly ambiguous; it cannot be reliably distinguished from DD-YYYY. For example, 03-2010 could equally be interpreted as day 3 of the year 2010 and also as a range from three to two thousand ten. Since we cannot confidently resolve this ambiguity, keeping these cases would risk incorrect normalizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants