Skip to content

Commit a30a485

Browse files
committed
Intern UAs on perf script intake
The perf scripts do multiple passes over the input[^1] so they need the entire input in memory. However they don't need to hold every line in memory individually: UA logs tend to be pretty redundant (80 to 95% depending on the site), and the strings are the vast majority of the payload (as they average 100~150 bytes each). We can memoize the inputs in order to dedup them, and while *usually* that's not a good idea since the content has to live for the entire program lifetime we can abuse `sys.intern` for it (reduces the amount of change necessary, python GCs `sys.intern` anyway). This reduces memory consumption of the UAs list by an order of magnitude or so[^2] which is *very* significant for large logs, although note that the bedaly simulator is heinously costly: running hitrates on on the 174M UAs "sample 2" dataset, memory use falls by 10~15GB when the belady sim completes, might be a good idea to try and find out which of its collections is the source of the problem and see if it can be improved upon. This also makes hitrates significantly faster on sample 2, likely as a combination of two factors (though that has not been confirmed in any way so YMMV): - lower memory / cache trashing from having to trawl less memory - much more efficient dict hit (a pointer comparison is sufficient to validate a key after hashcode check), especially combined with sample 2 having significantly higher hit rates than sample 1 (dailymotion) as a cache hit is a dict hit first (though there's costs associated with metadata maintenance afterwards) [^1]: technically they could do just one by interleaving all the parser configurations but currently that's not the case, I also worry that this would affect CPU-level prediction although I guess since this is Python it's not that much of a worry, and UA parsing is a pretty unpredictable workload... [^2]: UA strings are 100~150 bytes on average, dedup'ing them means storing an 8 byte pointer, plus the average of the UA length over its dupes, which averages out to a handful of bytes
1 parent f225dd1 commit a30a485

1 file changed

Lines changed: 4 additions & 4 deletions

File tree

src/ua_parser/__main__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ def get_rules(parsers: List[str], regexes: Optional[io.IOBase]) -> Matchers:
9898

9999

100100
def run_stdout(args: argparse.Namespace) -> None:
101-
lines = list(args.file)
101+
lines = list(map(sys.intern, args.file))
102102
count = len(lines)
103103
uniques = len(set(lines))
104104
print(f"{args.file.name}: {count} lines, {uniques} unique ({uniques / count:.0%})")
@@ -131,7 +131,7 @@ def run_stdout(args: argparse.Namespace) -> None:
131131

132132

133133
def run_csv(args: argparse.Namespace) -> None:
134-
lines = list(args.file)
134+
lines = list(map(sys.intern, args.file))
135135
LEN = len(lines) * 1000
136136
rules = get_rules(args.bases, args.regexes)
137137

@@ -288,7 +288,7 @@ def __call__(self, ua: str, domains: Domain, /) -> PartialResult:
288288
self.count += 1
289289
return r
290290

291-
lines = list(args.file)
291+
lines = list(map(sys.intern, args.file))
292292
total = len(lines)
293293
uniques = len(set(lines))
294294
print(total, "lines", uniques, "uniques")
@@ -343,7 +343,7 @@ def worker(
343343

344344

345345
def run_threaded(args: argparse.Namespace) -> None:
346-
lines = list(args.file)
346+
lines = list(map(sys.intern, args.file))
347347
basic = BasicResolver(load_builtins())
348348
resolvers: List[Tuple[str, Resolver]] = [
349349
("locking-lru", CachingResolver(basic, caching.Lru(CACHESIZE))),

0 commit comments

Comments
 (0)