Skip to content

Benchmark xan count --parallel and xan parallel cat -P 'select' #568

@Yomguithereal

Description

@Yomguithereal

Hi,

Thanks for the interesting benchmark. It seems it matches what I could observe on my own. Still I think you could add two things to it to have a broader picture, as xan also knows how to parallelize its computations:

  • You can try xan count --parallel, for the count bench
  • You can try xan parallel cat -P 'select <columns>' for the select bench

It seems zsv does not know how to parallelize unless the file is on disk so I suspect its approach is similar to the one used by xan, i.e. to chunk the file cleverly in constant time ahead of a map-reduce-like process (here is how xan does it, in any case: https://github.com/medialab/xan/blob/master/docs/blog/csv_base_jumping.md)?

I am curious to know if you ever attempted to leverage avx512. A lot of people are touting it can be even faster but I am skeptical. xan SIMD parser is not branchless in any case so I doubt it would give it an edge just yet, but it might for yours?

Another thing that could be good to bench also is a command that requires unquoting (xan count uses a parser that does not even attempt to separate cells, and xan select uses a zero-copy, non-unquoting parser as it is not required to shuffle columns around). I suspect your parser would be faster there also.

Else you write in your README:

non-4180-compliant data: zsv is fastest across the board (xan and polars are N/A for this input category)

xan can deal with such data also, by using a typical non-SIMD parser exposed by the xan input command, if you need to.

Best,

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions