|
10 | 10 | [](https://github.com/firelink-data/evolution/actions/workflows/ci.yml) |
11 | 11 | [](https://github.com/firelink-data/evolution/actions/workflows/tests.yml) |
12 | 12 |
|
13 | | -🦖 *Evolve your fixed length data files into Apache Arrow tables, fully parallelized!* |
14 | | - |
| 13 | +🦖 *Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!* |
15 | 14 |
|
16 | 15 | </div> |
17 | 16 |
|
18 | 17 | ## 🔎 Overview |
19 | 18 |
|
20 | 19 | ... |
21 | 20 |
|
| 21 | + |
22 | 22 | ## 📦 Installation |
23 | 23 |
|
24 | 24 | The easiest way to install *evolution* on your system is by using the [Cargo](https://crates.io/) package manager. |
25 | 25 | ``` |
26 | | -$ cargo install evolution |
| 26 | +cargo install evolution |
27 | 27 | ``` |
28 | 28 |
|
29 | 29 | Alternatively, you can build from source by cloning this repo and compiling using Cargo. |
30 | 30 | ``` |
31 | | -$ git clone https://github.com/firelink-data/evolution.git |
32 | | -$ cd evolution |
33 | | -$ cargo build --release |
| 31 | +git clone https://github.com/firelink-data/evolution.git |
| 32 | +cd evolution |
| 33 | +cargo build --release |
34 | 34 | ``` |
35 | 35 |
|
36 | | -Run a small conversion test using the "arrow" converter with slicer type "old" |
| 36 | +The program uses either of two different types of threading implementations. The default implementation uses the |
| 37 | +standard library threads and has so far proven a more reliable version, the alternative is by using [rayon](https://docs.rs/rayon/latest/rayon/) |
| 38 | +for parallel iteration. To use **rayon** instead, build or install the program with the `--features rayon` flag. |
| 39 | + |
| 40 | + |
| 41 | +## 🚀 Example usage |
| 42 | + |
| 43 | +If you build and/or install the program as explained above then by simply running the binary you will see the following: |
| 44 | +``` |
| 45 | +🦖 Evolve your fixed-length data files into Apache Arrow tables, fully parallelized! |
| 46 | +
|
| 47 | +Usage: evolution [OPTIONS] <COMMAND> |
| 48 | +
|
| 49 | +Commands: |
| 50 | + convert Convert a fixed-length file (.flf) to parquet |
| 51 | + mock Generate mocked fixed-length files (.flf) for testing purposes |
| 52 | + help Print this message or the help of the given subcommand(s) |
| 53 | +
|
| 54 | +Options: |
| 55 | + --n-threads <NUM-THREADS> Set the number of threads (logical cores) to use when multi-threading [default: 1] |
| 56 | + -h, --help Print help |
| 57 | + -V, --version Print version |
| 58 | +``` |
| 59 | + |
| 60 | +The functionality of the program is structured as two main commands: **mock** and **convert**. |
| 61 | + |
| 62 | +### 👨🎨 Mocking |
| 63 | + |
| 64 | +``` |
| 65 | +Generate mocked fixed-length files (.flf) for testing purposes |
| 66 | +
|
| 67 | +Usage: evolution mock [OPTIONS] --schema <SCHEMA> |
| 68 | +
|
| 69 | +Options: |
| 70 | + -s, --schema <SCHEMA> |
| 71 | + Specify the .json schema file to mock data for |
| 72 | + -o, --output-file <OUTPUT-FILE> |
| 73 | + Specify output (target) file name |
| 74 | + -n, --n-rows <NUM-ROWS> |
| 75 | + Set the number of rows to generate [default: 100] |
| 76 | + --buffer-size <BUFFER-SIZE> |
| 77 | + Set the size of the buffer (number of rows) |
| 78 | + --thread-channel-capacity <THREAD-CHANNEL-CAPACITY> |
| 79 | + Set the capacity of the thread channel (number of messages) |
| 80 | + -h, --help |
| 81 | + Print help |
| 82 | +``` |
| 83 | + |
| 84 | +For example, if you wanted to mock 1 billion rows of a fixed-length file from a schema located at `./my/path/to/schema.json` with |
| 85 | +the output name `mocked-data.flf`, you could run the following command: |
| 86 | +``` |
| 87 | +evolution mock --schema ./my/schema/path/schema.json --output-file mocked-data.flf --n-rows 1000000000 |
| 88 | +``` |
| 89 | + |
| 90 | +### 🏗️👷♂️ Converting |
| 91 | + |
| 92 | +``` |
| 93 | +Convert a fixed-length file (.flf) to parquet |
| 94 | +
|
| 95 | +Usage: evolution convert [OPTIONS] --file <FILE> --schema <SCHEMA> |
| 96 | +
|
| 97 | +Options: |
| 98 | + -f, --file <FILE> |
| 99 | + The fixed-length file to convert |
| 100 | + -o, --output-file <OUTPUT-FILE> |
| 101 | + Specify output (target) file name |
| 102 | + -s, --schema <SCHEMA> |
| 103 | + Specify the .json schema file to use when converting |
| 104 | + --buffer-size <BUFFER-SIZE> |
| 105 | + Set the size of the buffer (in bytes) |
| 106 | + --thread-channel-capacity <THREAD-CHANNEL-CAPACITY> |
| 107 | + Set the capacity of the thread channel (number of messages) |
| 108 | + -h, --help |
| 109 | + Print help |
| 110 | +``` |
| 111 | + |
| 112 | +To convert a fixed-length file called `really-big-data.flf`, with associated schema located at `./my/path/to/schema.json`, to a parquet file with name `smaller-data.parquet`, you could run the following command: |
| 113 | +``` |
| 114 | +evolution convert --file really-big-data.flf --output-file smaller-data.parquet --schema ./my/path/to/schema.json |
| 115 | +``` |
| 116 | + |
| 117 | +### 🧵 Threading |
| 118 | + |
| 119 | +There exists a global setting for the program called `--n-threads` which dictates whether or not the invoked command will be executed |
| 120 | +in single- or multithreaded mode. This argument should be a number representing the number of threads (logical cores) that you want |
| 121 | +to use. If you try and set a larger number of threads than you system has logical cores, then the program will use **all available |
| 122 | +logical cores**. If this argument is omitted, then the program will run in single-threaded mode. |
| 123 | + |
| 124 | +**Note that running multithreaded only really has any clear increase in performance for substantially large workloads.** |
| 125 | + |
| 126 | +### 🧵 Threading |
| 127 | +An experimental multithreaded implementation exists , it reads chunks of 2 megabytes and splits them into n anmounts of cores in O(1). |
| 128 | +Run a small conversion test using the "arrow" converter with slicer type "chunked" |
37 | 129 | ``` |
38 | | -$ cargo run --package evolution --release --bin evolution -- convert --schema resources/schema/test_schema.json --in-file resources/schema/test_schema_mock.txt --out-file out.parquet arrow old |
| 130 | +$ cargo run --package evolution --release --bin evolution -- convertchunked --schema resources/schema/test_schema.json --in-file resources/schema/test_schema_mock.txt --out-file out.parquet arrow chunked |
39 | 131 | ``` |
40 | 132 |
|
41 | 133 | ## 📋 License |
42 | | -All code is to be held under a general MIT license, please see [LICENSE](https://github.com/firelink-data/alloy/blob/main/LICENSE) for specific information. |
| 134 | +All code is to be held under a general MIT license, please see [LICENSE](https://github.com/firelink-data/evolution/blob/main/LICENSE) for specific information. |
0 commit comments