You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🦖 *Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!*
14
22
15
23
</div>
16
24
17
25
18
26
## 🔎 Overview
19
27
20
-
**Take your old and highly inefficient fixed-length files and evolve them into something more efficient, like Apache Parquet!**
28
+
Take your **old and inefficient fixed-length files** and **evolve them into a modern data format like Apache Parquet!**
21
29
22
-
This repository hosts the **evolution** program which both allows you to convert existing fixed-length files into other data formats,
23
-
but also allows you to create large amounts of mocked data blazingly fast. The program supports full parallelism and utilizes SIMD
24
-
techniques, when possible, for highly efficient parsing of data. To get started, follow the installation, setup, and example usage
25
-
sections below in this README.
30
+
This repository hosts the **evolution** program which both allows you to convert existing fixed-length files into other data formats, but also allows you to create large amounts of mocked data blazingly fast. The program supports full parallelism and utilizes SIMD techniques, when possible, for highly efficient parsing of data.
26
31
27
-
Happy hacking! 👋🥳
32
+
To get started, follow the installation, schema setup, and example usage sections below in this README. Happy hacking! 👋🥳
All of the code in this repo is open source and should be licensed according to [LICENSE](https://github.com/firelink-data/evolution/blob/main/LICENSE), refer to [this](https://github.com/firelink-data/evolution?tab=readme-ov-file#-license) link for more information.
The easiest way to install *evolution*on your system is by using the [Cargo](https://crates.io/) package manager.
49
+
The easiest way to install an **evolution** binary on your system is by using the [Cargo](https://crates.io/) package manager (which downloads it from [this](https://crates.io/crates/evolution) link).
44
50
```
45
-
cargo install evolution
51
+
$ cargo install evolution
52
+
53
+
(available features)
54
+
- rayon
55
+
- nightly
46
56
```
47
57
48
-
Alternatively, you can build from source by cloning this repo and compiling using Cargo.
58
+
Alternatively, you can build from source by cloning the repo and compiling using Cargo. See below for available optional features.
The program uses either of two different types of threading implementations. The default implementation uses the
56
-
standard library threads and has so far proven a more reliable version, the alternative is by using [rayon](https://docs.rs/rayon/latest/rayon/)
57
-
for parallel iteration. To use **rayon** instead, build or install the program with the `--features rayon` flag.
68
+
- Installing with the **rayon** feature will utilize the [rayon](https://docs.rs/rayon/latest/rayon/) crate for parallel execution instead of the standard library threads. It also enables converting in **chunked** mode. Please see [this](https://github.com/firelink-data/evolution?tab=readme-ov-file#%EF%B8%8F%EF%B8%8F-converting) reference for more information.
69
+
- Installing with the **nightly** feature will use the [nightly](https://doc.rust-lang.org/book/appendix-07-nightly-rust.html) toolchain, which in nature is unstable. To be able to run this version you need the nightly toolchain installed on your system. You can install this by running `rustup install nightly` from your shell.
58
70
59
71
60
72
## 📝 Schema setup
61
73
62
-
All available commands in *evolution* require an existing valid **schema**. A schema, in this context, is a [json](https://www.json.org/json-en.html)
63
-
file specifying the layout of the contents of a fixed-length file. Every schema used has to follow
64
-
[this](https://github.com/firelink-data/evolution/tree/main/resources/template-schema.json) template. If you are unsure whether or not your own schema
74
+
All available commands in **evolution** require an existing valid **schema**. A schema, in this context, is a [json](https://www.json.org/json-en.html) file specifying the layout of the contents of a fixed-length file (flf). Every schema used has to adhere to [this](https://github.com/firelink-data/evolution/tree/main/resources/template-schema.json) template. If you are unsure whether or not your own schema
65
75
file is valid according to the template, you can use [this](https://www.jsonschemavalidator.net/) validator tool.
66
76
67
-
An example schema can be found [here](https://github.com/firelink-data/evolution/tree/main/resources/example-schema.json), and looks like this:
77
+
An example schema can be found [here](https://github.com/firelink-data/evolution/blob/main/resources/example_schema.json), and looks like this:
68
78
```
69
79
{
70
80
"name": "EvolutionExampleSchema",
@@ -74,23 +84,23 @@ An example schema can be found [here](https://github.com/firelink-data/evolution
74
84
"name": "id",
75
85
"offset": 0,
76
86
"length": 9,
77
-
"dtype": "i32",
87
+
"dtype": "Int32",
78
88
"alignment": "Right",
79
-
"pad_symbol": "Zero",
89
+
"pad_symbol": "Underscore",
80
90
"is_nullable": false
81
91
},
82
92
{
83
93
"name": "name",
84
94
"offset": 9,
85
95
"length": 32,
86
-
"dtype": "utf8",
96
+
"dtype": "Utf8",
87
97
"is_nullable": true
88
98
},
89
99
{
90
100
"name": "city",
91
101
"offset": 41,
92
102
"length": 32,
93
-
"dtype": "utf8",
103
+
"dtype": "Utf8",
94
104
"alignment": "Right",
95
105
"pad_symbol": "Backslash",
96
106
"is_nullable": false
@@ -99,24 +109,24 @@ An example schema can be found [here](https://github.com/firelink-data/evolution
99
109
"name": "employed",
100
110
"offset": 73,
101
111
"length": 5,
102
-
"dtype": "boolean",
112
+
"dtype": "Boolean",
103
113
"alignment": "Center",
104
114
"pad_symbol": "Asterisk",
105
-
"is_nullable": false
115
+
"is_nullable": true
106
116
}
107
117
]
108
118
}
109
119
```
110
120
111
-
As specified in the template, all columns have to provide the following fields **(name, offset, length, is_nullable)**, whereas
112
-
**alignment** and **pad_symbol** can be omitted (as they are in this example for the *name* column). If they are not provided, they will assume their default values which are
113
-
"**Right**" and "**Whitespace**" respectively. These default values come from the [padder](https://github.com/firelink-data/padder) crate which defines the enums
121
+
- If you are unsure about valid values for the **dtype**, **alignment**, and **pad_symbol** fields, please referr to the template which lists all valid values.
122
+
- All columns have to provide the following fields **name**, **offset**, **length**, and **is_nullable**, whereas **alignment** and **pad_symbol** can be omitted (as they are in this example for the *name* column). If they are not provided, they will assume their default values which are "**Right**" and "**Whitespace**".
123
+
- The default values come from the [padder](https://github.com/firelink-data/padder) crate which defines the enums
114
124
`Alignment` and `Symbol`, with default implementations as `Alignment::Right` and `Symbol::Whitespace` respectively.
115
125
116
126
117
127
## 🚀 Example usage
118
128
119
-
If you build and/or install the program as explained above then by simply running the binary you will see the following:
129
+
If you install the program as explained above then by simply running the binary you will see the following helpful usage print:
120
130
```
121
131
🦖 Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!
122
132
@@ -133,20 +143,22 @@ Options:
133
143
-V, --version Print version
134
144
```
135
145
136
-
As you can see from above, the functionality of the program comprises of the two main commands: **convert** and **mock**.
146
+
As you can see from above, the functionality of the program comprises of the two main commands: **convert** and **mock**. If you installed the program with the **rayon** feature you will also have access to a third command called **c-convert**. This stands for **chunked-convert** and is an alternative implementation. Documentation for this command is work-in-progress.
147
+
148
+
- If you want to see debug prints during execution, set the `RUST_LOG` environment variable to `DEBUG` before executing the program.
Specify the .json schema file to use when converting
@@ -158,9 +170,9 @@ Options:
158
170
Print help
159
171
```
160
172
161
-
To convert a fixed-length file called `really-big-data.flf`, with associated schema located at `./my/path/to/schema.json`, to a parquet file with name `smaller-data.parquet`, you could run the following command:
173
+
To convert a fixed-length file called `old-data.flf`, with associated schema located at `./my/path/to/schema.json`, to a parquet file with name `converted.parquet`, you could run the following command:
Set the capacity of the thread channel (number of messages)
185
201
-h, --help
186
202
Print help
187
203
```
188
204
189
-
For example, if you wanted to mock 1 billion rows of a fixed-length file from a schema located at `./my/path/to/schema.json` with
190
-
the output name `mocked-data.flf`, you could run the following command:
205
+
For example, if you wanted to mock 1 billion rows of a fixed-length file from a schema located at `./my/path/to/schema.json` with the output name `mocked-data.flf` and enforce that the file should not already exist, you could run the following command:
There exists a global setting for the program called `--n-threads` which dictates whether or not the invoked command will be executed
199
-
in single- or multithreaded mode. This argument should be a number representing the number of threads (logical cores) that you want
200
-
to use. If you try and set a larger number of threads than you system has logical cores, then the program will use **all available
201
-
logical cores**. If this argument is omitted, then the program will run in single-threaded mode.
213
+
There exists a global setting for the program called `--n-threads` which dictates whether or not the invoked command will be executed in single- or multithreaded mode. This argument should be a number representing the number of threads (logical cores) that you want to use. If you try and set a larger number of threads than you system has logical cores, then the program will use **all available logical cores**. If this argument is omitted, then the program will run in single-threaded mode.
202
214
203
-
**Note that running multithreaded only really has any clear increase in performance for substantially large workloads.**
215
+
**Note that running in multithreaded mode only really has any clear increase in performance for substantially large workloads.**
204
216
205
-
If you are unsure how many logical cores your CPU has, the easiest way to find out is by simply running the program with the
206
-
`--n-threads` option set to a large number. The program will check how many logical cores you have and see whether
207
-
this option exceeds the possible value. If the value you passed is greater than the number of logical cores on your system, then
208
-
the number of logical cores available will be logged to you on stdout.
217
+
If you are unsure how many logical cores your CPU has, the easiest way to find out is by simply running the program with the `--n-threads` option set to a large number. The program will check how many logical cores you have and see whether this option exceeds the possible value. If the value you passed is greater than the number of logical cores on your system, then the number of logical cores available will be logged to you on stdout.
209
218
210
219
You could also potentially use one of the commands below depending on your host system.
0 commit comments