-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathideas.txt
More file actions
101 lines (68 loc) · 5.1 KB
/
ideas.txt
File metadata and controls
101 lines (68 loc) · 5.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
- compile before reading the header line; infer a number of checks that are
tested after reading the header line, currently also in compile_compute.
- table dictionary load immediately before compilation so that column/dictionary names are known.
- other dictionaries load after compilation is successful
- during compile, accumulate column names expected to be present.
- maybe -x and adjacent logic tricky.
- during compile, column names are not yet associated with offsets, so placeholder/postpone.
- pluk: ::foo,i ,u ,f -> validate or transform
? an @criterion that stops processing after true/false transition.
? in comparisons allow zero-arg primivites such as rowno.
/ SAM support. Document/tighten internal conventions.
Use '*' for "no query sequence" rather than empty string? Only externally, use empty string internally?
$ samtools view tst.vs.bam | head -n 30 | pick --sam ::,aln_edit
-- 30 times a reference sequence length lookup failed
be precise about when &sam_reflen is needed.
luspuv could be done by direct addressing rather than hash lookup, but urggh work. Maybe when finecombing or overhauling.
- automate operator descriptions in README.md
- discard when protection was needed -- currently this bypasses selection and assert. not great.
- quants: for very large data, use heap to find data separating values.
warn at compile time if dictionary does not exist.
- dord currently prevented from loading if not used, via on the nose check.
- does debug need stacky?
- dord only available with fdict kdict and cdict
- limit amount of errors output by hash key (errm, alert): further errors supressed.
- way to set not-found sort order for ,dord operator. (currently INT_MAX).
? make patchiness work with --sam-aln-context=posnum.
- patchiness/patchiness2:
patchiness2 improves by not increasing the current window size for insertions. test a bit more.
- can operator selection for suv lookup intercept be done by introspection of the subs without incurring B::Deparse overhead.
? @@ and @ selections allowed among command line arguments, to aid purpose self-explanation.
- use of int() in source; it truncates towards zero. 1.9999-like cases might be unwelcome.
For /any/ etc we need abs(int()), so depending on perl needs deep dive.
? x y v T test4 for om/ep family.
? protect against log(-1) etc
? --adict-DICT=longcolumnname:lcn,anotherlongcolumnname:alcn
(alias dict)
? refer to a list of things by name (variable)
? 'find a column (name?value?) that passes test'
-> tabulate fields absent (none "" / NA / NaN / -)
/ split & select? ^foo,bar,zut^,^2,splitget -> unpack, ^0:1:3-5,up_get
\ elaborate pack/unpack support. ^3,^%01,nspack ^%01,spackall val^%01,unspack
? Some way to reorder or swap columns by name
---
- challenge: combine ed and map. replace substring with its map. First ,get it and ,map, then ,ed it.
currently: echo -e "_a_\t3\n_b_\t4\naba\t5" | pick -AiK --cdict-foo=a:Alpha,b:Beta x::1^_'(.)'_,get^foo,map 1::1^'_\K(.)(?=_)':x,ed
where \K is available since Perl 5.10.0, 2007. Requires some perl, doesn't look great, but works.
? edmap => [ 3, sub { $::STACK[-3] =~ s/$::STACK[-2]/my $x = $::dict{$::STACK[-1]}{$1}; defined($x) ? $x : $1/e } ],
echo -e "_a_\t3\n_b_\t4\n_d_\t8" | pick -AviK --fdict-foo=dict --cdict-foo=a:Alpha,b:Beta,c:Gamma 1::1^'%5E_\K(.)(?=_$)'^foo,edmap
-> $1 not the right control? You might want to replace more than just $1, using the map of $1.
the two-step solution will allow this.
? get: option to return field if no match? -> covered by ed; get allows filtering on empty string with uie
? predefine constants (e.g. log10, pi). new syntax, e.g. ^^PI ^^LOG10 ^^E ^^PHI - no real need.
? implement in C or Rust (use pcre2). string/float/int the main point of pain.
# -T pushes column; slightly inelegant but inevitable. -N -L fixed as rowno lineno operators.
# _ in alignment string (dangling sequence) can refer to both reference and query. Let's make ` for reference.
x similar to --sam also --fastq (columns 1 2 3 4), --fasta (1 2) ? unlikely, use bioawk for that.
note --fasta-dict-NAME= and --fastq-dict-NAME= to make sequences available.
x Alternative implementation to pstore: push header with $_ . '_ref', push @F with with @F_ref.
Then -A requires @header_orig as one kind of fiddliness. Also $::N, $B_printall
Better to write a separate utility for this kind of self-join.
x numerical comparisons check field is numeric; however duplicates perl's work/warnings
x --group-all -> do not skip first row (re-populate %::pstore_cache from %::pstore_init)
problematic that repopulation only happens after compute finished. This is fine if first
row is skipped, but pretty impossible for the first row. Current status: overreach.
! (pluk) column type can be inferred from operations and comparisons at compile time.
string by default, numeric as needed.
large integers in rust?
integer versions of numeric operations. imul isub iadd