Intern UAs on perf script intake

masklinn · masklinn · commit a30a485c934c · 2026-02-23T18:59:30.000+01:00
The perf scripts do multiple passes over the input[^1] so they need
the entire input in memory. However they don't need to hold every line
in memory individually: UA logs tend to be pretty redundant (80 to 95%
depending on the site), and the strings are the vast majority of the
payload (as they average 100~150 bytes each).

We can memoize the inputs in order to dedup them, and while *usually*
that's not a good idea since the content has to live for the entire
program lifetime we can abuse `sys.intern` for it (reduces the amount
of change necessary, python GCs `sys.intern` anyway).

This reduces memory consumption of the UAs list by an order of
magnitude or so[^2] which is *very* significant for large logs,
although note that the bedaly simulator is heinously costly: running
hitrates on on the 174M UAs "sample 2" dataset, memory use falls by
10~15GB when the belady sim completes, might be a good idea to try and
find out which of its collections is the source of the problem and see
if it can be improved upon.

This also makes hitrates significantly faster on sample 2, likely as a
combination of two factors (though that has not been confirmed in any
way so YMMV):

- lower memory / cache trashing from having to trawl less memory
- much more efficient dict hit (a pointer comparison is sufficient to
  validate a key after hashcode check), especially combined with
  sample 2 having significantly higher hit rates than sample 1
  (dailymotion) as a cache hit is a dict hit first (though there's
  costs associated with metadata maintenance afterwards)

[^1]: technically they could do just one by interleaving all the
  parser configurations but currently that's not the case, I also
  worry that this would affect CPU-level prediction although I guess
  since this is Python it's not that much of a worry, and UA parsing
  is a pretty unpredictable workload...
[^2]: UA strings are 100~150 bytes on average, dedup'ing them means
  storing an 8 byte pointer, plus the average of the UA length over
  its dupes, which averages out to a handful of bytes
diff --git a/src/ua_parser/__main__.py b/src/ua_parser/__main__.py
@@ -98,7 +98,7 @@ def get_rules(parsers: List[str], regexes: Optional[io.IOBase]) -> Matchers:
 
 
 def run_stdout(args: argparse.Namespace) -> None:
-    lines = list(args.file)
+    lines = list(map(sys.intern, args.file))
     count = len(lines)
     uniques = len(set(lines))
     print(f"{args.file.name}: {count} lines, {uniques} unique ({uniques / count:.0%})")
@@ -131,7 +131,7 @@ def run_stdout(args: argparse.Namespace) -> None:
 
 
 def run_csv(args: argparse.Namespace) -> None:
-    lines = list(args.file)
+    lines = list(map(sys.intern, args.file))
     LEN = len(lines) * 1000
     rules = get_rules(args.bases, args.regexes)
 
@@ -288,7 +288,7 @@ def __call__(self, ua: str, domains: Domain, /) -> PartialResult:
             self.count += 1
             return r
 
-    lines = list(args.file)
+    lines = list(map(sys.intern, args.file))
     total = len(lines)
     uniques = len(set(lines))
     print(total, "lines", uniques, "uniques")
@@ -343,7 +343,7 @@ def worker(
 
 
 def run_threaded(args: argparse.Namespace) -> None:
-    lines = list(args.file)
+    lines = list(map(sys.intern, args.file))
     basic = BasicResolver(load_builtins())
     resolvers: List[Tuple[str, Resolver]] = [
         ("locking-lru", CachingResolver(basic, caching.Lru(CACHESIZE))),