pick/ideas.txt at main · micans/pick · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

  -  compile before reading the header line; infer a number of checks that are
     tested after reading the header line, currently also in compile_compute.
     - table dictionary load immediately before compilation so that column/dictionary names are known.
     - other dictionaries load after compilation is successful
     - during compile, accumulate column names expected to be present.
     - maybe -x and adjacent logic tricky.
     - during compile, column names are not yet associated with offsets, so placeholder/postpone.
     - pluk: ::foo,i  ,u ,f  -> validate or transform

  ?  an @criterion that stops processing after true/false transition.

  ?  in comparisons allow zero-arg primivites such as rowno.

  /  SAM support. Document/tighten internal conventions.
     Use '*' for "no query sequence" rather than empty string? Only externally, use empty string internally?

     $ samtools view tst.vs.bam | head -n 30 | pick --sam ::,aln_edit
     -- 30 times a reference sequence length lookup failed
     be precise about when &sam_reflen is needed.

     luspuv could be done by direct addressing rather than hash lookup, but urggh work. Maybe when finecombing or overhauling.

  -  automate operator descriptions in README.md

  -  discard when protection was needed -- currently this bypasses selection and assert. not great.

  -  quants: for very large data, use heap to find data separating values.
     warn at compile time if dictionary does not exist.

  -  dord currently prevented from loading if not used, via on the nose check.

  -  does debug need stacky?

  -  dord only available with fdict kdict and cdict

  -  limit amount of errors output by hash key (errm, alert): further errors supressed.

  -  way to set not-found sort order for ,dord operator. (currently INT_MAX).

  ?  make patchiness work with --sam-aln-context=posnum.
  -  patchiness/patchiness2:
      patchiness2 improves by not increasing the current window size for insertions. test a bit more.

  -  can operator selection for suv lookup intercept be done by introspection of the subs without incurring B::Deparse overhead.

  ?  @@ and @ selections allowed among command line arguments, to aid purpose self-explanation.

  -  use of int() in source; it truncates towards zero. 1.9999-like cases might be unwelcome.
     For /any/ etc we need abs(int()), so depending on perl needs deep dive.

  ?   x y v T test4 for om/ep family.

  ?  protect against log(-1) etc

  ?  --adict-DICT=longcolumnname:lcn,anotherlongcolumnname:alcn
      (alias dict)

  ?  refer to a list of things by name (variable)
  ?  'find a column (name?value?) that passes test'
  -> tabulate fields absent (none "" / NA / NaN / -)

  /  split & select? ^foo,bar,zut^,^2,splitget -> unpack, ^0:1:3-5,up_get
  \  elaborate pack/unpack support. ^3,^%01,nspack  ^%01,spackall  val^%01,unspack

  ?  Some way to reorder or swap columns by name

---

  -  challenge: combine ed and map. replace substring with its map. First ,get it and ,map, then ,ed it.
        currently: echo -e "_a_\t3\n_b_\t4\naba\t5" | pick -AiK --cdict-foo=a:Alpha,b:Beta x::1^_'(.)'_,get^foo,map 1::1^'_\K(.)(?=_)':x,ed
        where \K is available since Perl 5.10.0, 2007. Requires some perl, doesn't look great, but works.
     ?  edmap => [ 3, sub { $::STACK[-3] =~ s/$::STACK[-2]/my $x = $::dict{$::STACK[-1]}{$1}; defined($x) ? $x : $1/e  } ],
        echo -e "_a_\t3\n_b_\t4\n_d_\t8" | pick -AviK --fdict-foo=dict --cdict-foo=a:Alpha,b:Beta,c:Gamma 1::1^'%5E_\K(.)(?=_$)'^foo,edmap
     -> $1 not the right control? You might want to replace more than just $1, using the map of $1.
        the two-step solution will allow this.

  ?  get: option to return field if no match? -> covered by ed; get allows filtering on empty string with uie
  ?  predefine constants (e.g. log10, pi). new syntax, e.g. ^^PI ^^LOG10 ^^E ^^PHI - no real need.
  ?  implement in C or Rust (use pcre2). string/float/int the main point of pain.
  #  -T pushes column; slightly inelegant but inevitable. -N -L fixed as rowno lineno operators.

  #  _ in alignment string (dangling sequence) can refer to both reference and query. Let's make ` for reference.

  x  similar to --sam also --fastq (columns 1 2 3 4), --fasta (1 2) ?  unlikely, use bioawk for that.
     note --fasta-dict-NAME= and --fastq-dict-NAME= to make sequences available.

  x  Alternative implementation to pstore: push header with $_ . '_ref', push @F with with @F_ref.
     Then -A requires @header_orig as one kind of fiddliness. Also $::N, $B_printall
     Better to write a separate utility for this kind of self-join.

  x  numerical comparisons check field is numeric; however duplicates perl's work/warnings

  x  --group-all -> do not skip first row (re-populate %::pstore_cache from %::pstore_init)
     problematic that repopulation only happens after compute finished. This is fine if first
     row is skipped, but pretty impossible for the first row. Current status: overreach.

  !  (pluk) column type can be inferred from operations and comparisons at compile time.
     string by default, numeric as needed.
     large integers in rust?
     integer versions of numeric operations. imul isub iadd