Python Deliberate Practice
First of all, don't be afraid, read Plateau of Productivity. More importantly, be patient, a good read from Peter Norvig, titled Teach Yourself Programming in 10 years.
Motivation
Language war between Python and R is one of the most frequently discussed topics among the Data Scientists, and there doesn't seem to be a consensus on which one is better. Personally, I used both R and Python, but for very different purposes. I mainly use tidyverse packages (dplyr + ggplot2) to carry out data analyses and data visualization, while using Python for web scraping, task automations, and building basic web applications in Flask.
By now, I have a pretty good working knowledge of the R language. There are obviously many more things that I can learn - in particular building and maintaining R packages as well as more advanced R materials. Yet, the appeal of Python has always been there for me for a few reasons:
- It's a general purpose programming language, so presumably it is a lot easier to learn good software engineering principles. (What are they though?)
- Many of the data stacks are built using the tools in the Python ecosystem (ETL using Airflow, Front-end using Flask with RESTful API supports, Machine Learning using scikit-learn) - being able to use the same language for different parts of the data stack will bring prototypes closer to production.
To me, the appeal of Python is not necessarily the Data Analysis part, R is already doing a great job on this. Rather, the appeal of using Python for data work is that you have a higher chance to see how data plays a role within the whole integrated technology stack. Knowing Python is likely to make me a better end-to-end Data Scientist and better Software Engineer.
Here is a great reddit answer that explains the intersection and disjoint union of the two languages beautifully.
Deliberate Practice
I am a huge believer in learning by doing, and there are a lot of opportunities on the job where I can hone my Python skills through Deliberate Practice:
-
Identify the Top Performers: I think there are quite a few people at Work (e.g. Dan F.) who can really be a role model for me to follow. Understand what they've been through to get to where they are today. What is their mental representation that I do not have about Python.
-
Build Practice Plans: Ideally, based on the rough understanding of that mental representation:
- Define clear goals and select learning materials
- Create deadline and milestones for the project
- Estimate time required and come up weekly schedules
Augment these insights with your current level of mental representation of Python to improve your understanding.
-
Targeted Practice: If I force myself to switch over to Python for Data Analysis, Data visualization, Modeling, or contribute to our internal Python Data Analysis packages, I can maximize my time practicing this skill, which is high leverage.
-
Immediate Feedbacks: We have a culture of code reviews, both for IC work as well as internal package work. The former is harder because most DS on our team are in the R camp. There's also the weekly Python office hours that should be very useful. Find constant opportunities to get feedback as much as you can.
Performance Goals
- [Immediate] Learn to write pythonic code
- [Shorter term, easiest to practice] Write re-usable, modular, tested code for my data work and knowledge posts
- [Medium term, harder to practice] Achieve efficiency and feature parity on Data Analysis using Python compared to R
- [Longer term, hardest to practice] Write tools. Being able to work on projects that span the entire data stack using Python, apply good software engineering principles to these projects
Project Goals
-
Outcome: I want to move my data stack to Python completely. This means my day-to-day data analysis work will be done in Python instead of R, make my code as pythonic as possible. Become a Contributor to Airpy / tools, and take on one bigger Python project (ML, Data Viz ...etc).
-
Curriculum: I want do everything that I can to go through all the basic materials in Pandas/Matplotlib combo. Expose myself to functional programming, OOP, testing in Python, or even making command tools. Get feedbacks from Airpy team members.
-
Timeframe: Efficiency parity by end of October. One contribution to Airpy by Mid November. One ongoing big project touching different stacks in Python by the end of 2016.
Project Milestones
-
Learning Python & Best Practices
-
Writing Pythonic Code
- Guidelines For Writing Pythonic Code
- Function: Use *args and **kwargs to accept arbitrary arguments in function definition
- Tuples: effective unpacking, use _ for placeholder, swap values without tmp variables
- List/Dict/Set: list comprehension, dict comprehension. dict.get, set comprehension
- Strings: use .format, use .join
- Classes: use __ __ in function and variable name to mark private variables
- Generator: use generator to lazily load a infinite sequence
- Modules: writing modules for encapsulation
- Formatting: pep8 standards
- Executable script: name = main
- Import: The right way to do imports
- Writing Idiomatic Python - Jeff Knupp
- Stanford CS 41: Idiomatic Python
- Another Tutorial On How To Write Pythonic Code
- Guidelines For Writing Pythonic Code
-
iPython Notebook
-
Pandas For Data Analysis
-
Data Visualization
- BIDS: Python Bootcamp: Intro to Matplotlib: The 800 pound gorilla, everything is customizable, but very low level
- Seaborn: Good for statistical visualization. I still find it a bit limited on the type of simple plots it can do
- Bokeh: Interactive, web browser base data visualization
- A Dramatic Tour through Pythonโs Data Visualization Landscape (including ggplot and Altair)
-
Writing Object Oriented Programming Python Code
- Computational Biology: OOP For Scientist
- Improve Your Python: Jeff Knupp: OOP
- BIDS: Python Bootcamp: OOP
- Simeon Franklin's Twitter University Class (not available to the public)
-
Writing Functional Programming Python Code
-
Machine Learning In Python
-
Testing In Python
Next Steps / Level In 2017
Once mastered all the above, the next natural step is to create public work that other people can use so you can democratize your useful tool to others. A great introduction to how to get started is from Tim Hopper's talk, titled Sharing Your Side Projects.
-
Logging In Python (Next Year?)
-
Writing Command-Line Tool (Next Year?)
-
Building Packages In Python (Next Year?)
Reference
- Python Tutor Visualizer
- Python For Data Analysis
- Stanford CS 41: Python
- Berkeley CS 88: Python Data Structure
- Harvard CS 109: Data Science
- Berkeley BIDS Python bootcamp
- Josh Bloom's Python Computing For Data Science
- Writing Idiomatic Python - Jeff Knupp
- Another Tutorial On How To Write Pythonic Code
- Pandas Cookbook
- Udemy course