Hi,
Thanks for the interesting benchmark. It seems it matches what I could observe on my own. Still I think you could add two things to it to have a broader picture, as xan also knows how to parallelize its computations:
- You can try
xan count --parallel, for the count bench
- You can try
xan parallel cat -P 'select <columns>' for the select bench
It seems zsv does not know how to parallelize unless the file is on disk so I suspect its approach is similar to the one used by xan, i.e. to chunk the file cleverly in constant time ahead of a map-reduce-like process (here is how xan does it, in any case: https://github.com/medialab/xan/blob/master/docs/blog/csv_base_jumping.md)?
I am curious to know if you ever attempted to leverage avx512. A lot of people are touting it can be even faster but I am skeptical. xan SIMD parser is not branchless in any case so I doubt it would give it an edge just yet, but it might for yours?
Another thing that could be good to bench also is a command that requires unquoting (xan count uses a parser that does not even attempt to separate cells, and xan select uses a zero-copy, non-unquoting parser as it is not required to shuffle columns around). I suspect your parser would be faster there also.
Else you write in your README:
non-4180-compliant data: zsv is fastest across the board (xan and polars are N/A for this input category)
xan can deal with such data also, by using a typical non-SIMD parser exposed by the xan input command, if you need to.
Best,
Hi,
Thanks for the interesting benchmark. It seems it matches what I could observe on my own. Still I think you could add two things to it to have a broader picture, as
xanalso knows how to parallelize its computations:xan count --parallel, for the count benchxan parallel cat -P 'select <columns>'for the select benchIt seems
zsvdoes not know how to parallelize unless the file is on disk so I suspect its approach is similar to the one used byxan, i.e. to chunk the file cleverly in constant time ahead of a map-reduce-like process (here is howxandoes it, in any case: https://github.com/medialab/xan/blob/master/docs/blog/csv_base_jumping.md)?I am curious to know if you ever attempted to leverage
avx512. A lot of people are touting it can be even faster but I am skeptical.xanSIMD parser is not branchless in any case so I doubt it would give it an edge just yet, but it might for yours?Another thing that could be good to bench also is a command that requires unquoting (
xan countuses a parser that does not even attempt to separate cells, andxan selectuses a zero-copy, non-unquoting parser as it is not required to shuffle columns around). I suspect your parser would be faster there also.Else you write in your README:
xancan deal with such data also, by using a typical non-SIMD parser exposed by thexan inputcommand, if you need to.Best,