Skip to content

Commit e970b66

Browse files
committed
docs: rewrite readme remove bloat
1 parent 9c5b05f commit e970b66

1 file changed

Lines changed: 53 additions & 177 deletions

File tree

README.md

Lines changed: 53 additions & 177 deletions
Original file line numberDiff line numberDiff line change
@@ -1,218 +1,94 @@
1-
# Evolution
2-
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
3-
[![Crates.io (latest)](https://img.shields.io/crates/v/evolution)](https://crates.io/crates/evolution)
4-
[![codecov](https://codecov.io/gh/firelink-data/evolution/graph/badge.svg?token=B95DUS13B5)](https://codecov.io/gh/firelink-data/evolution)
5-
[![CI](https://github.com/firelink-data/evolution/actions/workflows/ci.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/ci.yml)
6-
[![CD](https://github.com/firelink-data/evolution/actions/workflows/cd.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/cd.yml)
7-
[![Tests](https://github.com/firelink-data/evolution/actions/workflows/tests.yml/badge.svg)](https://github.com/firelink-data/evolution/actions/workflows/tests.yml)
1+
# evolution
82

9-
> 🦖 A modern, platform agnostic, and highly efficient framework that makes it easy to convert and evolve<br/>fixed-length files into robust and future-proof targets, such as, but not limited to, **Parquet**, **Delta**, or **Iceberg**.
3+
> A robust, platform agnostic, and highly efficient framework for converting old fixed-length files to future-proof targets suitable for analytics and data science.
104
11-
This repository hosts the **evolution** project which both allows you to convert existing fixed-length files into other data formats, but also allows you to create large amounts of mocked data blazingly fast. The program supports full parallelism and utilizes SIMD techniques, when possible, for highly efficient parsing of data.
5+
The evolution project was created as a response to the emergin need for a tool which can transform old fixed-length files to data formats which seamlessly integrate with the modern data analytics landscape, whilst being able to do so fully automatically.
126

13-
To get started, follow the installation, schema setup, and example usage sections below in this README.
7+
We utilize the native speed of Rust together with multithreading and SIMD techniques to efficiently transform your old fixed-length files (of any size!) to a more modern target. The only target currently implemented is **parquet**, but we aim to implement support for **delta**, **iceberg**, **indradb**, and more.
148

9+
The project is structured as a monorepo which hosts all of the *evolution* framework components, which can be found under [crates/](crates/) as their own modules. A modular monorepo design of the framework allows anyone to implement their own target converters that can seamlessly integrate with core frameworks existing functionality.
1510

16-
## 📋 Table of contents
1711

18-
All of the code in this repo is open source and should be licensed according to [LICENSE](https://github.com/firelink-data/evolution/blob/main/LICENSE), refer to [this](https://github.com/firelink-data/evolution?tab=readme-ov-file#-license) link for more information.
12+
## Installation
1913

20-
* [Installation](https://github.com/firelink-data/evolution?tab=readme-ov-file#-installation)
21-
* [Schema setup](https://github.com/firelink-data/evolution?tab=readme-ov-file#-schema-setup)
22-
* [Example usage](https://github.com/firelink-data/evolution?tab=readme-ov-file#-example-usage)
23-
* [Converting](https://github.com/firelink-data/evolution?tab=readme-ov-file#%EF%B8%8F%EF%B8%8F-converting)
24-
* [Mocking](https://github.com/firelink-data/evolution?tab=readme-ov-file#-mocking)
25-
* [Threading](https://github.com/firelink-data/evolution?tab=readme-ov-file#-threading)
14+
The easiest way to install a complete *evolution* binary on your system is by using the [Cargo](https://crates.io/) package manager (which downloads it from [this](https://crates.io/crates/evolution) link).
2615

27-
28-
## 📦 Installation
29-
30-
The easiest way to install an **evolution** binary on your system is by using the [Cargo](https://crates.io/) package manager (which downloads it from [this](https://crates.io/crates/evolution) link).
31-
```
32-
$ cargo install evolution
33-
34-
(available features)
35-
- rayon
36-
- nightly
37-
```
38-
39-
Alternatively, you can build from source by cloning the repo and compiling using Cargo. See below for available optional features.
4016
```
41-
$ git clone https://github.com/firelink-data/evolution.git
42-
$ cd evolution
43-
$ cargo build --release
17+
cargo install evolution
4418
45-
(optional: copy the binary to your users binray folder)
46-
$ cp ./target/release/evolution /usr/bin/evolution
19+
(available features)
20+
- mock
21+
- nightly
4722
```
4823

49-
- Installing with the **rayon** feature will utilize the [rayon](https://docs.rs/rayon/latest/rayon/) crate for parallel execution instead of the standard library threads. It also enables converting in **chunked** mode. Please see [this](https://github.com/firelink-data/evolution?tab=readme-ov-file#%EF%B8%8F%EF%B8%8F-converting) reference for more information.
50-
- Installing with the **nightly** feature will use the [nightly](https://doc.rust-lang.org/book/appendix-07-nightly-rust.html) toolchain, which in nature is unstable. To be able to run this version you need the nightly toolchain installed on your system. You can install this by running `rustup install nightly` from your shell.
51-
52-
53-
## 📝 Schema setup
24+
Alternatively you can build everything from source by cloing the repo and compiling using Cargo.
5425

55-
All available commands in **evolution** require an existing valid **schema**. A schema, in this context, is a [json](https://www.json.org/json-en.html) file specifying the layout of the contents of a fixed-length file (flf). Every schema used has to adhere to [this](https://github.com/firelink-data/evolution/tree/main/resources/template-schema.json) template. If you are unsure whether or not your own schema
56-
file is valid according to the template, you can use [this](https://www.jsonschemavalidator.net/) validator tool.
57-
58-
An example schema can be found [here](https://github.com/firelink-data/evolution/blob/main/resources/example_schema.json), and looks like this:
59-
```
60-
{
61-
"name": "EvolutionExampleSchema",
62-
"version": 1337,
63-
"columns": [
64-
{
65-
"name": "id",
66-
"offset": 0,
67-
"length": 9,
68-
"dtype": "Int32",
69-
"alignment": "Right",
70-
"pad_symbol": "Underscore",
71-
"is_nullable": false
72-
},
73-
{
74-
"name": "name",
75-
"offset": 9,
76-
"length": 32,
77-
"dtype": "Utf8",
78-
"is_nullable": true
79-
},
80-
{
81-
"name": "city",
82-
"offset": 41,
83-
"length": 32,
84-
"dtype": "Utf8",
85-
"alignment": "Right",
86-
"pad_symbol": "Backslash",
87-
"is_nullable": false
88-
},
89-
{
90-
"name": "employed",
91-
"offset": 73,
92-
"length": 5,
93-
"dtype": "Boolean",
94-
"alignment": "Center",
95-
"pad_symbol": "Asterisk",
96-
"is_nullable": true
97-
}
98-
]
99-
}
10026
```
101-
102-
- If you are unsure about valid values for the **dtype**, **alignment**, and **pad_symbol** fields, please referr to the template which lists all valid values.
103-
- All columns have to provide the following fields **name**, **offset**, **length**, and **is_nullable**, whereas **alignment** and **pad_symbol** can be omitted (as they are in this example for the *name* column). If they are not provided, they will assume their default values which are "**Right**" and "**Whitespace**".
104-
- The default values come from the [padder](https://github.com/firelink-data/padder) crate which defines the enums
105-
`Alignment` and `Symbol`, with default implementations as `Alignment::Right` and `Symbol::Whitespace` respectively.
106-
107-
108-
## ⚡️ Quick start
109-
110-
If you install the program as explained above then by simply running the binary you will see the following helpful usage print:
27+
git clone https://github.com/firelink-data/evolution.git
28+
cd evolution
29+
cargo build --release
11130
```
112-
🦖 Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!
11331

114-
Usage: evolution [OPTIONS] <COMMAND>
32+
If you want to integrate any of the evolution crates in your own project, simply add them as dependencies to your projects Cargo.toml file, for example.
11533

116-
Commands:
117-
convert Convert a fixed-length file (.flf) to parquet
118-
mock Generate mocked fixed-length files (.flf) for testing purposes
119-
help Print this message or the help of the given subcommand(s)
120-
121-
Options:
122-
--n-threads <NUM-THREADS> Set the number of threads (logical cores) to use when multi-threading [default: 1]
123-
-h, --help Print help
124-
-V, --version Print version
34+
```toml
35+
[dependencies]
36+
evolution-common = "1.0.0"
37+
evolution-schema = "1.0.0"
38+
...
12539
```
12640

127-
As you can see from above, the functionality of the program comprises of the two main commands: **convert** and **mock**. If you installed the program with the **rayon** feature you will also have access to a third command called **c-convert**. This stands for **chunked-convert** and is an alternative implementation. Documentation for this command is work-in-progress.
12841

129-
- If you want to see debug prints during execution, set the `RUST_LOG` environment variable to `DEBUG` before executing the program.
42+
## Schema setup
13043

44+
To be able to work with automatic file conversion you need to have a valid **schema** available which specifies the structure of the source file you want to convert. A valid schema, in this context, is a json file which adhers to [this template](https://github.com/firelink-data/evolution/tree/main/resources/template-schema.json). If you are unsure whether or not your own schema file is valid according to the template, you can use [this](https://www.jsonschemavalidator.net/) validator tool.
13145

132-
### 🏗️👷‍♂️ Converting
46+
An example schema can be found [here](https://github.com/firelink-data/evolution/blob/main/examples/full/res/example_schema.json), and if you are unsure about valid values for datatypes, alignment modes, and padding symbols, please refer to the [template](https://github.com/firelink-data/evolution/blob/main/examples/full/res/template_schema.json) which lists all valid values. For specifics on all the currently supported padding modes, characters, and default values, please see the [padder](https://github.com/firelink-data/padder) crate (which we also maintain).
13347

134-
```
135-
Convert a fixed-length file (.flf) to parquet
13648

137-
Usage: evolution convert [OPTIONS] --in-file <IN-FILE> --out-file <OUT-FILE> --schema <SCHEMA>
49+
## Quick start
13850

139-
Options:
140-
-i, --in-file <IN-FILE>
141-
The fixed-length file to convert
142-
-o, --out-file <OUT-FILE>
143-
Specify output (target) file name
144-
-s, --schema <SCHEMA>
145-
Specify the .json schema file to use when converting
146-
--buffer-size <BUFFER-SIZE>
147-
Set the size of the buffer (in bytes)
148-
--thread-channel-capacity <THREAD-CHANNEL-CAPACITY>
149-
Set the capacity of the thread channel (number of messages)
150-
-h, --help
151-
Print help
152-
```
51+
If you install the program as explained above then by simply running the binary you will see the following helpful usage print:
15352

154-
To convert a fixed-length file called `old-data.flf`, with associated schema located at `./my/path/to/schema.json`, to a parquet file with name `converted.parquet`, you could run the following command:
155-
```
156-
$ evolution convert --in-file old-data.flf --out-file converted.parquet --schema ./my/path/to/schema.json
15753
```
54+
🦖 Evolve your fixed-length data files into Apache Parquet, fully parallelized!
15855
56+
Usage: evolution.exe [OPTIONS] <COMMAND>
15957
160-
### 👨‍🎨 Mocking
161-
162-
```
163-
Generate mocked fixed-length files (.flf) for testing purposes
164-
165-
Usage: evolution mock [OPTIONS] --schema <SCHEMA>
58+
Commands:
59+
convert Convert a fixed-length file to another file format
60+
mock Generate mocked fixed-length files
61+
help Print this message or the help of the given subcommand(s)
16662
16763
Options:
168-
-s, --schema <SCHEMA>
169-
Specify the .json schema file to mock data for
170-
-o, --out-file <OUT-FILE>
171-
Specify output (target) file name
172-
-n, --n-rows <NUM-ROWS>
173-
Set the number of rows to generate [default: 100]
174-
--force-new
175-
Set the writer option to fail if the file already exists
176-
--truncate-existing
177-
Set the writer option to truncate a previous file if the out file already exists
178-
--buffer-size <MOCKER-BUFFER-SIZE>
179-
Set the size of the buffer (number of rows)
180-
--thread-channel-capacity <MOCKER-THREAD-CHANNEL-CAPACITY>
181-
Set the capacity of the thread channel (number of messages)
64+
-N, --n-threads <N_THREADS>
65+
Enable multithreading and set the number of threads (logical cores) to use [default: 1]
66+
-C, --thread-channel-capacity <THREAD_CHANNEL_CAPACITY>
67+
The maximum amount of messages that can be accumulated in the thread channels before holding
68+
-R, --read-buffer-size <READ_BUFFER_SIZE>
69+
The size of the read buffer used when converting (in bytes) [default: 5368709120]
70+
-W, --write-buffer-size <WRITE_BUFFER_SIZE>
71+
The size of the write buffer used when mocking (in rows) [default: 1000000]
18272
-h, --help
18373
Print help
74+
-V, --version
75+
Print version
18476
```
18577

186-
For example, if you wanted to mock 1 billion rows of a fixed-length file from a schema located at `./my/path/to/schema.json` with the output name `mocked-data.flf` and enforce that the file should not already exist, you could run the following command:
187-
```
188-
$ evolution mock --schema ./my/path/to/schema.json --out-file mocked-data.flf --n-rows 1000000000 --force-new
189-
```
190-
191-
192-
### 🧵 Threading
78+
To specify the log verbosity of the program, set the `RUST_LOG` environment variable to your wanted value, e.g., `INFO` or `DEBUG`.
19379

194-
There exists a global setting for the program called `--n-threads` which dictates whether or not the invoked command will be executed in single- or multithreaded mode. This argument should be a number representing the number of threads (logical cores) that you want to use. If you try and set a larger number of threads than you system has logical cores, then the program will use **all available logical cores**. If this argument is omitted, then the program will run in single-threaded mode.
19580

196-
**Note that running in multithreaded mode only really has any clear increase in performance for substantially large workloads.**
81+
## Threading
19782

198-
If you are unsure how many logical cores your CPU has, the easiest way to find out is by simply running the program with the `--n-threads` option set to a large number. The program will check how many logical cores you have and see whether this option exceeds the possible value. If the value you passed is greater than the number of logical cores on your system, then the number of logical cores available will be logged to you on stdout.
199-
200-
You could also potentially use one of the commands below depending on your host system.
201-
202-
### Windows
203-
```
204-
$ Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors
205-
```
206-
207-
Use the value found under **NumberOfLogicalProcessors**.
208-
209-
### Unix
210-
```
211-
$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
212-
```
83+
To know how many threads (logical cores) you have available on your system you can use any of the following commands depending on your host system:
21384

214-
The number of logical cores is calculed as: **threads per core X cores per socket X sockets**.
85+
- Windows:
86+
- Command: `Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors`
87+
- Use the value found under **NumberOfLogicalProcessors**.
88+
- Unix:
89+
- Command: `lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('`
90+
- The number of logical cores is calculed as: **threads per core X cores per socket X sockets**.
21591

21692

217-
## ⚠️ License
218-
All code is to be held under a general MIT license, please see [LICENSE](https://github.com/firelink-data/evolution/blob/main/LICENSE) for specific information.
93+
## License
94+
All code is copyright of [firelink](https://github.com/firelink-data/) and published under a general MIT license, please see [LICENSE](https://github.com/firelink-data/evolution/blob/main/LICENSE) for specific information.

0 commit comments

Comments
 (0)