Skip to content

Commit e360e57

Browse files
authored
[docs] update README and add CD workflow (#36)
* [docs] evolution banner * [docs] update README and fix cli help messages * [build] CD workflow to crates.io
1 parent d98e589 commit e360e57

4 files changed

Lines changed: 101 additions & 62 deletions

File tree

.github/workflows/cd.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: CD
2+
3+
on:
4+
release:
5+
types: [ published ]
6+
7+
jobs:
8+
deploy:
9+
runs-on: ${{ matrix.os }}
10+
strategy:
11+
matrix:
12+
os: [ ubuntu-latest ]
13+
rust: [ stable ]
14+
steps:
15+
- name: Checkout
16+
uses: actions/checkout@v4
17+
- name: Install toolchain
18+
uses: actions-rs/toolchain@v1
19+
with:
20+
profile: minimal
21+
toolchain: ${{ matrix.rust }}
22+
override: true
23+
- name: Cargo check
24+
uses: actions-rs/cargo@v1
25+
with:
26+
command: check
27+
- name: Cargo publish
28+
run: cargo publish --token ${CRATES_TOKEN}
29+
env:
30+
CRATES_TOKEN: ${{ secrets.CRATES_TOKEN }}

README.md

Lines changed: 70 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,80 @@
11
<div align="center">
22
<br/>
3+
<br/>
34
<div align="left">
45
<br/>
6+
<p align="center">
7+
<a href="https://github.com/firelink-data/evolution">
8+
<img align="center" width=50% src="./resources/images/evolution-banner.png"></img>
9+
</a>
10+
</p>
511
</div>
12+
<br/>
13+
<br/>
614

715
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
816
[![Crates.io (latest)](https://img.shields.io/crates/v/evolution)](https://crates.io/crates/evolution)
917
[![codecov](https://codecov.io/gh/firelink-data/evolution/graph/badge.svg?token=B95DUS13B5)](https://codecov.io/gh/firelink-data/evolution)
1018
[![CI](https://github.com/firelink-data/evolution/actions/workflows/ci.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/ci.yml)
19+
[![CD](https://github.com/firelink-data/evolution/actions/workflows/cd.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/cd.yml)
1120
[![Tests](https://github.com/firelink-data/evolution/actions/workflows/tests.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/tests.yml)
1221

13-
🦖 *Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!*
1422

1523
</div>
1624

1725

1826
## 🔎 Overview
1927

20-
**Take your old and highly inefficient fixed-length files and evolve them into something more efficient, like Apache Parquet!**
28+
Take your **old and inefficient fixed-length files** and **evolve them into a modern data format like Apache Parquet!**
2129

22-
This repository hosts the **evolution** program which both allows you to convert existing fixed-length files into other data formats,
23-
but also allows you to create large amounts of mocked data blazingly fast. The program supports full parallelism and utilizes SIMD
24-
techniques, when possible, for highly efficient parsing of data. To get started, follow the installation, setup, and example usage
25-
sections below in this README.
30+
This repository hosts the **evolution** program which both allows you to convert existing fixed-length files into other data formats, but also allows you to create large amounts of mocked data blazingly fast. The program supports full parallelism and utilizes SIMD techniques, when possible, for highly efficient parsing of data.
2631

27-
Happy hacking! 👋🥳
32+
To get started, follow the installation, schema setup, and example usage sections below in this README. Happy hacking! 👋🥳
2833

2934

3035
## 📋 Table of contents
31-
https://github.com/firelink-data/evolution/edit/feat/single-threaded/README.md
32-
* [Installation](https://github.com/firelink-data/evolution#-installation)
33-
* [Schema setup](https://github.com/firelink-data/evolution#-schema-setup)
34-
* [Example usage](https://github.com/firelink-data/evolution#-example-usage)
35-
* [Converting]()
36-
* [Mocking]()
37-
* [Threading]()
38-
* [License](https://github.com/firelink-data/evolution#-license)
36+
37+
All of the code in this repo is open source and should be licensed according to [LICENSE](https://github.com/firelink-data/evolution/blob/main/LICENSE), refer to [this](https://github.com/firelink-data/evolution?tab=readme-ov-file#-license) link for more information.
38+
39+
* [Installation](https://github.com/firelink-data/evolution?tab=readme-ov-file#-installation)
40+
* [Schema setup](https://github.com/firelink-data/evolution?tab=readme-ov-file#-schema-setup)
41+
* [Example usage](https://github.com/firelink-data/evolution?tab=readme-ov-file#-example-usage)
42+
* [Converting](https://github.com/firelink-data/evolution?tab=readme-ov-file#%EF%B8%8F%EF%B8%8F-converting)
43+
* [Mocking](https://github.com/firelink-data/evolution?tab=readme-ov-file#-mocking)
44+
* [Threading](https://github.com/firelink-data/evolution?tab=readme-ov-file#-threading)
3945

4046

4147
## 📦 Installation
4248

43-
The easiest way to install *evolution* on your system is by using the [Cargo](https://crates.io/) package manager.
49+
The easiest way to install an **evolution** binary on your system is by using the [Cargo](https://crates.io/) package manager (which downloads it from [this](https://crates.io/crates/evolution) link).
4450
```
45-
cargo install evolution
51+
$ cargo install evolution
52+
53+
(available features)
54+
- rayon
55+
- nightly
4656
```
4757

48-
Alternatively, you can build from source by cloning this repo and compiling using Cargo.
58+
Alternatively, you can build from source by cloning the repo and compiling using Cargo. See below for available optional features.
4959
```
50-
git clone https://github.com/firelink-data/evolution.git
51-
cd evolution
52-
cargo build --release
60+
$ git clone https://github.com/firelink-data/evolution.git
61+
$ cd evolution
62+
$ cargo build --release
63+
64+
(optional: copy the binary to your users binray folder)
65+
$ cp ./target/release/evolution /usr/bin/evolution
5366
```
5467

55-
The program uses either of two different types of threading implementations. The default implementation uses the
56-
standard library threads and has so far proven a more reliable version, the alternative is by using [rayon](https://docs.rs/rayon/latest/rayon/)
57-
for parallel iteration. To use **rayon** instead, build or install the program with the `--features rayon` flag.
68+
- Installing with the **rayon** feature will utilize the [rayon](https://docs.rs/rayon/latest/rayon/) crate for parallel execution instead of the standard library threads. It also enables converting in **chunked** mode. Please see [this](https://github.com/firelink-data/evolution?tab=readme-ov-file#%EF%B8%8F%EF%B8%8F-converting) reference for more information.
69+
- Installing with the **nightly** feature will use the [nightly](https://doc.rust-lang.org/book/appendix-07-nightly-rust.html) toolchain, which in nature is unstable. To be able to run this version you need the nightly toolchain installed on your system. You can install this by running `rustup install nightly` from your shell.
5870

5971

6072
## 📝 Schema setup
6173

62-
All available commands in *evolution* require an existing valid **schema**. A schema, in this context, is a [json](https://www.json.org/json-en.html)
63-
file specifying the layout of the contents of a fixed-length file. Every schema used has to follow
64-
[this](https://github.com/firelink-data/evolution/tree/main/resources/template-schema.json) template. If you are unsure whether or not your own schema
74+
All available commands in **evolution** require an existing valid **schema**. A schema, in this context, is a [json](https://www.json.org/json-en.html) file specifying the layout of the contents of a fixed-length file (flf). Every schema used has to adhere to [this](https://github.com/firelink-data/evolution/tree/main/resources/template-schema.json) template. If you are unsure whether or not your own schema
6575
file is valid according to the template, you can use [this](https://www.jsonschemavalidator.net/) validator tool.
6676

67-
An example schema can be found [here](https://github.com/firelink-data/evolution/tree/main/resources/example-schema.json), and looks like this:
77+
An example schema can be found [here](https://github.com/firelink-data/evolution/blob/main/resources/example_schema.json), and looks like this:
6878
```
6979
{
7080
"name": "EvolutionExampleSchema",
@@ -74,23 +84,23 @@ An example schema can be found [here](https://github.com/firelink-data/evolution
7484
"name": "id",
7585
"offset": 0,
7686
"length": 9,
77-
"dtype": "i32",
87+
"dtype": "Int32",
7888
"alignment": "Right",
79-
"pad_symbol": "Zero",
89+
"pad_symbol": "Underscore",
8090
"is_nullable": false
8191
},
8292
{
8393
"name": "name",
8494
"offset": 9,
8595
"length": 32,
86-
"dtype": "utf8",
96+
"dtype": "Utf8",
8797
"is_nullable": true
8898
},
8999
{
90100
"name": "city",
91101
"offset": 41,
92102
"length": 32,
93-
"dtype": "utf8",
103+
"dtype": "Utf8",
94104
"alignment": "Right",
95105
"pad_symbol": "Backslash",
96106
"is_nullable": false
@@ -99,24 +109,24 @@ An example schema can be found [here](https://github.com/firelink-data/evolution
99109
"name": "employed",
100110
"offset": 73,
101111
"length": 5,
102-
"dtype": "boolean",
112+
"dtype": "Boolean",
103113
"alignment": "Center",
104114
"pad_symbol": "Asterisk",
105-
"is_nullable": false
115+
"is_nullable": true
106116
}
107117
]
108118
}
109119
```
110120

111-
As specified in the template, all columns have to provide the following fields **(name, offset, length, is_nullable)**, whereas
112-
**alignment** and **pad_symbol** can be omitted (as they are in this example for the *name* column). If they are not provided, they will assume their default values which are
113-
"**Right**" and "**Whitespace**" respectively. These default values come from the [padder](https://github.com/firelink-data/padder) crate which defines the enums
121+
- If you are unsure about valid values for the **dtype**, **alignment**, and **pad_symbol** fields, please referr to the template which lists all valid values.
122+
- All columns have to provide the following fields **name**, **offset**, **length**, and **is_nullable**, whereas **alignment** and **pad_symbol** can be omitted (as they are in this example for the *name* column). If they are not provided, they will assume their default values which are "**Right**" and "**Whitespace**".
123+
- The default values come from the [padder](https://github.com/firelink-data/padder) crate which defines the enums
114124
`Alignment` and `Symbol`, with default implementations as `Alignment::Right` and `Symbol::Whitespace` respectively.
115125

116126

117127
## 🚀 Example usage
118128

119-
If you build and/or install the program as explained above then by simply running the binary you will see the following:
129+
If you install the program as explained above then by simply running the binary you will see the following helpful usage print:
120130
```
121131
🦖 Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!
122132
@@ -133,20 +143,22 @@ Options:
133143
-V, --version Print version
134144
```
135145

136-
As you can see from above, the functionality of the program comprises of the two main commands: **convert** and **mock**.
146+
As you can see from above, the functionality of the program comprises of the two main commands: **convert** and **mock**. If you installed the program with the **rayon** feature you will also have access to a third command called **c-convert**. This stands for **chunked-convert** and is an alternative implementation. Documentation for this command is work-in-progress.
147+
148+
- If you want to see debug prints during execution, set the `RUST_LOG` environment variable to `DEBUG` before executing the program.
137149

138150

139151
### 🏗️👷‍♂️ Converting
140152

141153
```
142154
Convert a fixed-length file (.flf) to parquet
143155
144-
Usage: evolution convert [OPTIONS] --file <FILE> --schema <SCHEMA>
156+
Usage: evolution convert [OPTIONS] --in-file <IN-FILE> --out-file <OUT-FILE> --schema <SCHEMA>
145157
146158
Options:
147-
-f, --file <FILE>
159+
-i, --in-file <IN-FILE>
148160
The fixed-length file to convert
149-
-o, --output-file <OUTPUT-FILE>
161+
-o, --out-file <OUT-FILE>
150162
Specify output (target) file name
151163
-s, --schema <SCHEMA>
152164
Specify the .json schema file to use when converting
@@ -158,9 +170,9 @@ Options:
158170
Print help
159171
```
160172

161-
To convert a fixed-length file called `really-big-data.flf`, with associated schema located at `./my/path/to/schema.json`, to a parquet file with name `smaller-data.parquet`, you could run the following command:
173+
To convert a fixed-length file called `old-data.flf`, with associated schema located at `./my/path/to/schema.json`, to a parquet file with name `converted.parquet`, you could run the following command:
162174
```
163-
evolution convert --file really-big-data.flf --output-file smaller-data.parquet --schema ./my/path/to/schema.json
175+
$ evolution convert --in-file old-data.flf --out-file converted.parquet --schema ./my/path/to/schema.json
164176
```
165177

166178

@@ -174,51 +186,48 @@ Usage: evolution mock [OPTIONS] --schema <SCHEMA>
174186
Options:
175187
-s, --schema <SCHEMA>
176188
Specify the .json schema file to mock data for
177-
-o, --output-file <OUTPUT-FILE>
189+
-o, --out-file <OUT-FILE>
178190
Specify output (target) file name
179191
-n, --n-rows <NUM-ROWS>
180192
Set the number of rows to generate [default: 100]
181-
--buffer-size <BUFFER-SIZE>
193+
--force-new
194+
Set the writer option to fail if the file already exists
195+
--truncate-existing
196+
Set the writer option to truncate a previous file if the out file already exists
197+
--buffer-size <MOCKER-BUFFER-SIZE>
182198
Set the size of the buffer (number of rows)
183-
--thread-channel-capacity <THREAD-CHANNEL-CAPACITY>
199+
--thread-channel-capacity <MOCKER-THREAD-CHANNEL-CAPACITY>
184200
Set the capacity of the thread channel (number of messages)
185201
-h, --help
186202
Print help
187203
```
188204

189-
For example, if you wanted to mock 1 billion rows of a fixed-length file from a schema located at `./my/path/to/schema.json` with
190-
the output name `mocked-data.flf`, you could run the following command:
205+
For example, if you wanted to mock 1 billion rows of a fixed-length file from a schema located at `./my/path/to/schema.json` with the output name `mocked-data.flf` and enforce that the file should not already exist, you could run the following command:
191206
```
192-
evolution mock --schema ./my/schema/path/schema.json --output-file mocked-data.flf --n-rows 1000000000
207+
$ evolution mock --schema ./my/path/to/schema.json --out-file mocked-data.flf --n-rows 1000000000 --force-new
193208
```
194209

195210

196211
### 🧵 Threading
197212

198-
There exists a global setting for the program called `--n-threads` which dictates whether or not the invoked command will be executed
199-
in single- or multithreaded mode. This argument should be a number representing the number of threads (logical cores) that you want
200-
to use. If you try and set a larger number of threads than you system has logical cores, then the program will use **all available
201-
logical cores**. If this argument is omitted, then the program will run in single-threaded mode.
213+
There exists a global setting for the program called `--n-threads` which dictates whether or not the invoked command will be executed in single- or multithreaded mode. This argument should be a number representing the number of threads (logical cores) that you want to use. If you try and set a larger number of threads than you system has logical cores, then the program will use **all available logical cores**. If this argument is omitted, then the program will run in single-threaded mode.
202214

203-
**Note that running multithreaded only really has any clear increase in performance for substantially large workloads.**
215+
**Note that running in multithreaded mode only really has any clear increase in performance for substantially large workloads.**
204216

205-
If you are unsure how many logical cores your CPU has, the easiest way to find out is by simply running the program with the
206-
`--n-threads` option set to a large number. The program will check how many logical cores you have and see whether
207-
this option exceeds the possible value. If the value you passed is greater than the number of logical cores on your system, then
208-
the number of logical cores available will be logged to you on stdout.
217+
If you are unsure how many logical cores your CPU has, the easiest way to find out is by simply running the program with the `--n-threads` option set to a large number. The program will check how many logical cores you have and see whether this option exceeds the possible value. If the value you passed is greater than the number of logical cores on your system, then the number of logical cores available will be logged to you on stdout.
209218

210219
You could also potentially use one of the commands below depending on your host system.
211220

212221
### Windows
213222
```
214-
Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors
223+
$ Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors
215224
```
216225

217226
Use the value found under **NumberOfLogicalProcessors**.
218227

219228
### Unix
220229
```
221-
lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
230+
$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
222231
```
223232

224233
The number of logical cores is calculed as: **threads per core X cores per socket X sockets**.
28.7 KB
Loading

src/cli.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ enum Commands {
194194
)]
195195
n_rows: Option<usize>,
196196

197-
/// Set the writer mode to create a new file or fail if it already exists.
197+
/// Set the writer option to fail if the file already exists.
198198
#[arg(
199199
long = "force-new",
200200
value_name = "WRITER-FORCE-NEW",

0 commit comments

Comments
 (0)