Cookiecutter BackMarket Template for Data Science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. Including Docker makefile, .env for (secrets) and MkDocks functionality setup by default.

Why use this project structure?

Ultimately, data science code quality is about correctness and reproducibility!

When we think about data analysis, we often think just about the resulting reports, insights, or visualizations. While these end products are generally the main event, it's easy to focus on making the products look nice and ignore the quality of the code that generates them. Because these end products are created programmatically, code quality is still important! It's best to start with a clean, logical structure and stick to it throughout. We think it's a pretty big win all around to use a fairly standardized setup like this one. Here's why:

Other people will thank you

A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don't necessarily have to read 100% of the code before knowing where to look for very specific things.

Well organized code tends to be self-documenting in that the organization itself provides context for your code without much overhead. People will thank you for this because they can:

Collaborate more easily with you on this analysis
Learn from your analysis about the process and the domain
Feel confident in the conclusions at which the analysis arrives

Because that default project structure is logical and reasonably standard across most projects, it is much easier for somebody who has never seen a particular project to figure out where they would find the various moving parts.

Another great example is the Filesystem Hierarchy Standard for Unix-like systems. The /etc directory has a very specific purpose, as does the /tmp folder, and everybody (more or less) agrees to honor that social contract. That means a Red Hat user and an Ubuntu user both know roughly where to look for certain types of files, even when using each other's system — or any other standards-compliant system for that matter!

Ideally, that's how it should be when a colleague opens up your data science project.

You will thank you

Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py to get things done. Here are some questions we've learned to ask with a sense of existential dread:

Are we supposed to go in and join the column X to the data before we get started or did that come from one of the notebooks?
Come to think of it, which notebook do we have to run first before running the plotting code: was it "process data" or "clean data"?
Where did the shapefiles get downloaded from for the geographic plots?
Et cetera, times infinity.

These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work, for example separation of concerns, abstracting analysis as a DAG, and engineering best practices like version control.

Go for it! This is a lightweight structure, and is intended to be a good starting point for many projects. Or, as PEP 8 put it:

Consistency within a project is more important. Consistency within one module or function is the most important. ... However, know when to be inconsistent -- sometimes style guide recommendations just aren't applicable. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search