Skip to content

Commit e442521

Browse files
committed
merged
1 parent ecd9ca7 commit e442521

1 file changed

Lines changed: 101 additions & 9 deletions

File tree

README.md

Lines changed: 101 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,33 +10,125 @@
1010
[![CI](https://github.com/firelink-data/evolution/actions/workflows/ci.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/ci.yml)
1111
[![Tests](https://github.com/firelink-data/evolution/actions/workflows/tests.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/tests.yml)
1212

13-
🦖 *Evolve your fixed length data files into Apache Arrow tables, fully parallelized!*
14-
13+
🦖 *Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!*
1514

1615
</div>
1716

1817
## 🔎 Overview
1918

2019
...
2120

21+
2222
## 📦 Installation
2323

2424
The easiest way to install *evolution* on your system is by using the [Cargo](https://crates.io/) package manager.
2525
```
26-
$ cargo install evolution
26+
cargo install evolution
2727
```
2828

2929
Alternatively, you can build from source by cloning this repo and compiling using Cargo.
3030
```
31-
$ git clone https://github.com/firelink-data/evolution.git
32-
$ cd evolution
33-
$ cargo build --release
31+
git clone https://github.com/firelink-data/evolution.git
32+
cd evolution
33+
cargo build --release
3434
```
3535

36-
Run a small conversion test using the "arrow" converter with slicer type "old"
36+
The program uses either of two different types of threading implementations. The default implementation uses the
37+
standard library threads and has so far proven a more reliable version, the alternative is by using [rayon](https://docs.rs/rayon/latest/rayon/)
38+
for parallel iteration. To use **rayon** instead, build or install the program with the `--features rayon` flag.
39+
40+
41+
## 🚀 Example usage
42+
43+
If you build and/or install the program as explained above then by simply running the binary you will see the following:
44+
```
45+
🦖 Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!
46+
47+
Usage: evolution [OPTIONS] <COMMAND>
48+
49+
Commands:
50+
convert Convert a fixed-length file (.flf) to parquet
51+
mock Generate mocked fixed-length files (.flf) for testing purposes
52+
help Print this message or the help of the given subcommand(s)
53+
54+
Options:
55+
--n-threads <NUM-THREADS> Set the number of threads (logical cores) to use when multi-threading [default: 1]
56+
-h, --help Print help
57+
-V, --version Print version
58+
```
59+
60+
The functionality of the program is structured as two main commands: **mock** and **convert**.
61+
62+
### 👨‍🎨 Mocking
63+
64+
```
65+
Generate mocked fixed-length files (.flf) for testing purposes
66+
67+
Usage: evolution mock [OPTIONS] --schema <SCHEMA>
68+
69+
Options:
70+
-s, --schema <SCHEMA>
71+
Specify the .json schema file to mock data for
72+
-o, --output-file <OUTPUT-FILE>
73+
Specify output (target) file name
74+
-n, --n-rows <NUM-ROWS>
75+
Set the number of rows to generate [default: 100]
76+
--buffer-size <BUFFER-SIZE>
77+
Set the size of the buffer (number of rows)
78+
--thread-channel-capacity <THREAD-CHANNEL-CAPACITY>
79+
Set the capacity of the thread channel (number of messages)
80+
-h, --help
81+
Print help
82+
```
83+
84+
For example, if you wanted to mock 1 billion rows of a fixed-length file from a schema located at `./my/path/to/schema.json` with
85+
the output name `mocked-data.flf`, you could run the following command:
86+
```
87+
evolution mock --schema ./my/schema/path/schema.json --output-file mocked-data.flf --n-rows 1000000000
88+
```
89+
90+
### 🏗️👷‍♂️ Converting
91+
92+
```
93+
Convert a fixed-length file (.flf) to parquet
94+
95+
Usage: evolution convert [OPTIONS] --file <FILE> --schema <SCHEMA>
96+
97+
Options:
98+
-f, --file <FILE>
99+
The fixed-length file to convert
100+
-o, --output-file <OUTPUT-FILE>
101+
Specify output (target) file name
102+
-s, --schema <SCHEMA>
103+
Specify the .json schema file to use when converting
104+
--buffer-size <BUFFER-SIZE>
105+
Set the size of the buffer (in bytes)
106+
--thread-channel-capacity <THREAD-CHANNEL-CAPACITY>
107+
Set the capacity of the thread channel (number of messages)
108+
-h, --help
109+
Print help
110+
```
111+
112+
To convert a fixed-length file called `really-big-data.flf`, with associated schema located at `./my/path/to/schema.json`, to a parquet file with name `smaller-data.parquet`, you could run the following command:
113+
```
114+
evolution convert --file really-big-data.flf --output-file smaller-data.parquet --schema ./my/path/to/schema.json
115+
```
116+
117+
### 🧵 Threading
118+
119+
There exists a global setting for the program called `--n-threads` which dictates whether or not the invoked command will be executed
120+
in single- or multithreaded mode. This argument should be a number representing the number of threads (logical cores) that you want
121+
to use. If you try and set a larger number of threads than you system has logical cores, then the program will use **all available
122+
logical cores**. If this argument is omitted, then the program will run in single-threaded mode.
123+
124+
**Note that running multithreaded only really has any clear increase in performance for substantially large workloads.**
125+
126+
### 🧵 Threading
127+
An experimental multithreaded implementation exists , it reads chunks of 2 megabytes and splits them into n anmounts of cores in O(1).
128+
Run a small conversion test using the "arrow" converter with slicer type "chunked"
37129
```
38-
$ cargo run --package evolution --release --bin evolution -- convert --schema resources/schema/test_schema.json --in-file resources/schema/test_schema_mock.txt --out-file out.parquet arrow old
130+
$ cargo run --package evolution --release --bin evolution -- convertchunked --schema resources/schema/test_schema.json --in-file resources/schema/test_schema_mock.txt --out-file out.parquet arrow chunked
39131
```
40132

41133
## 📋 License
42-
All code is to be held under a general MIT license, please see [LICENSE](https://github.com/firelink-data/alloy/blob/main/LICENSE) for specific information.
134+
All code is to be held under a general MIT license, please see [LICENSE](https://github.com/firelink-data/evolution/blob/main/LICENSE) for specific information.

0 commit comments

Comments
 (0)