Hi TN date class accuracy improvement by shrpawar-alt · Pull Request #418 · NVIDIA/NeMo-text-processing

shrpawar-alt · 2026-04-21T10:53:22Z

What does this PR do ?

Improved Date class accuracy from ~87 % to ~99 % by introducing additional graph coverage for the cases failing earlier.

Before your PR is "Ready for review"

Pre checks:

PR Type:

New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

for more information, see https://pre-commit.ci

…ion of more test cases Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

for more information, see https://pre-commit.ci

…mm-yyyy, dd-m-yyyy and mm-yyyy Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

for more information, see https://pre-commit.ci

mgrafu · 2026-05-07T18:22:04Z


-# Create union of suffixes and prefixes
 suffix_union = pynini.union(*suffixes_list)
 prefix_union = pynini.union(*prefixes_list)


can l36 through here be replaced by string file?

Thanks for the suggestion! I explored pynini.string_file() for both. It works for suffix_union if we add identity columns to suffixes.tsv, since suffix comes after graph_year and doesn't affect year graph path selection. However, for prefix_union, even with identity columns, string_file() caused graph_year_thousands to incorrectly win over graph_year_hundreds_as_thousands for years like 1999, 1920, 1971, because prefix_union comes before graph_year in the concatenation, and the identity transducer from string_file() alters the weight landscape at that point.
I considered adjusting weights to compensate, but since graph_year_thousands and graph_year_hundreds_as_thousands are designed to be mutually exclusive, it was unclear why string_file() was breaking that exclusivity, making weight tuning risky. To keep both consistent and avoid that risk, I've now used pynini.string_map() for both, as it eliminates the intermediate list variables and the verbose open() + pynini.union() block while correctly handling single-column entries and preserving the expected FST behavior.

mgrafu · 2026-05-07T18:44:24Z

+०४-०३~चार मार्च
 25-03-2020~पच्चीस मार्च दो हज़ार बीस
 ३०-०५-२०७०~तीस मई दो हज़ार सत्तर
-12-07-1970~बारह जुलाई उन्नीस सौ सत्तर


why delete this?

This test case has been restored.

mgrafu · 2026-05-07T18:44:32Z

-12-07-1970~बारह जुलाई उन्नीस सौ सत्तर
 ०९-१२-२१०१~नौ दिसंबर इक्कीस सौ एक
 23-08-2024~तेईस अगस्त दो हज़ार चौबीस
-१०-२९-२०००~अक्टूबर उनतीस दो हज़ार


why delete these?

१०-२९-२०००, 11-14-1100 (MM-DD-YYYY cases):
These were removed following your earlier feedback on MM-DD: "if we can't normalize all, does it really make sense to cover for some?" The same applies to MM-DD-YYYY: since we can only handle cases where the day is unambiguously > 12, the coverage is partial. To stay consistent with that decision, we removed MM-DD-YYYY support as well.

०३-२०१०, 11-2024 (MM-YYYY cases):
These were removed because MM-YYYY is not a standard date format and is highly ambiguous; it cannot be reliably distinguished from DD-YYYY. For example, 03-2010 could equally be interpreted as day 3 of the year 2010 and also as a range from three to two thousand ten. Since we cannot confidently resolve this ambiguity, keeping these cases would risk incorrect normalizations.

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

for more information, see https://pre-commit.ci

shrpawar-alt marked this pull request as ready for review April 21, 2026 11:16

shrpawar-alt added 2 commits April 24, 2026 05:36

date class accuracy improvement

88940da

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

Jenkins file date update for Hi TN

017a615

Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>

shrpawar-alt force-pushed the hi-tn-date-v2 branch from 79f8c53 to 017a615 Compare April 24, 2026 08:05

[pre-commit.ci] auto fixes from pre-commit.com hooks

64fb638

for more information, see https://pre-commit.ci

mgrafu changed the base branch from main to staging/hi_tn_v3 April 28, 2026 19:44