Rounding a Corner in my Master’s Degree Journey
I recently submitted the final exam for the High Performance Computing course in my Master’s Degree program in Health Data Science. While this represents completion of 70% of my overall coursework (assuming I pass, fingers crossed…) this was the final code-specific data science course. I now have three remaining classes — two operational research / healthcare focused courses, and a capstone course (read: quasi-internship, where I’ll most likely end up slightly shifting my current day job just a bit to meet requirements, plus formal reporting and such.) Due to scheduling availability and dependency requirements, it will be next summer before I complete the degree (booooo….) but this is a significant milestone nonetheless.
What I learned in this last course:
- High Performance Computing means something very different in cloud world than it does for most companies / individuals. I am reminded that what I consider to be table stakes infrastructure design is actually pretty cutting edge in most circles. It’s a good reminder to get from time to time.
- Python and R really do have some great capabilities individually. R tends to be more polished and intuitive, but Python trades some of this for ability to massively scale more easily. If you’re just getting started, R is probably faster to learn and get into more complex analysis more quickly, but you’re eventually going to want to handle problems that R isn’t as well suited for, so you might be better off biting the bullet and starting (and sticking) with Python. But that’s just my current opinion, and subject to change as the technologies and my awareness of them continue to evolve.
- Deep Learning is hella slow, comparitively, but on super large/complex data (think images, unstructured text data, etc.) nothing touches it in terms of accuracy. XGBoost is hella fast, easy and accurate on large tabular data, generally, but (despite much hoopla to the contrary) not always the best fit. Good ol’ Logistic/Linear Regression can, depending on the nature of your data, do just as good or better than the more advanced approaches at a fraction of the time and compute power cost. (And based on prior learnings outside the course, one shouldn’t discard Nearest Neighbors or Naive Bayes either. The world is a big place, with a lot of data, and you don’t know what you’ve got till you get it and test it out.)
- Dask is the shiznet.
I’m really looking forward to what the next phase of this journey holds!