For what it's worth, this is my preferred structure for a python/R project (provided code is in python, but similar concepts exist in R)
The core principles are:
In projects, updates (not code runs) generally happen in discrete buckets:
- Data Preprocessing
- Modeling
- Output
Because of this we don't want to mix the code. This allows us to revert individual files to a certain commit, rather than figuring out what has to be reverted in each individual file. It also makes testing your code in discrete units a bit simpler. NOTE: OS X -> Linux has an issue with this only because when we use explicit, it gets the specific package and that's architecture specific.
I prefer using the spec-file vs. the requirements file because in DS projects, it's easier to create a conda env from it:
conda list --explicit > spec-file.txt # creates the export
conda create --name new_env --file spec-file.txt # creates the envMake sure you run the following command in your project:
pip install -e . #run this under your code directory I like using hydra-core to maintain run configurations. It has a default set of configurations that you maintain through a yaml file. Each parameter can then be overridden through commandline prompts. This also provides information to engineering what bits of your configuration might be user configuarable
It pays to break down each step of data processing into a specific class. This allows each seperate step to
- be debugged independently,
- tested independently,
- if clients are indecisive, makes it easy revert individual components of the pipeline back with a git command rather than having to modify code (with tests this is less error prone as well)
- If you want to ship a model, you can pickle the single pipeline object, and with transforms that require fitting, that comes included without needing to do anything special
- This makes it easy for the engineers to wrap whatever you did into an application
The test folder should be an exact copy of your src folder, except that rather than the code itself, you're writing tests. One of the most useful things I find about having tests is that during a code review, if I think that someone isn't covering all edge cases, I can easily plop that edge case into the tests, and see whether it passes or not.
Link to the presentation that covers more of this: