What every aspiring data analysts should be learning first

Statistics.

Get familiar and comfortable with combinations, permutations, different types of probability, linear regression, hypothesis testing, different types of distributions.

 

Python or R?

When it comes to analysis, Python is mostly used for Machine Learning and Deep Learning. Python use to have an advantage over R when it came to Deep Learning due to Python having exclusive support from TensorFlow, Keras, and Hadoop (or was it Spark?). As of writing this, all of those tools now support R. This could cause a problem for Python due to the academic world’s preference towards R, but forced to use Python for Deep Learning/AI.

R is the king of statistical analysis, but Python has statistical packages to turn it into a really good piece of software for statistical analysis, but R far superior in this field, and requires less code to do the same things.

Also, R has better data visualization tools: ggplot2, ggvis, Shiny, rCharts, etc. Though the king of data visualization is D3.js. Some R, Python, Scala, and Julia (the current data analysis languages) data pros who focus heavily on data visualization and story telling use D3.js, but it requires knowing HTML + CSS + JavaScript. It’s a bit too much for most people.

Either way you go, you’ll be fine. The biggest advantages are

Pros

Python: As of now has books on how to take advantage of TensorFlow and Keras for Deep Learning/AI. R is going to take at least a year to get good books on using these tools. Python has some sort of database advantage over R, but I’m not sure what that is. Has Plotly support.

R: Has an unbelievable amount of books on statistical analysis and various analysis techniques. R has an incredibly active community (#rstats) that is 100% focused on analysis. R looks to be the preferred language in the job market for Data Analysts. Rmarkdown and R Notebooks are fantastic. R pubs is fantastic. Has Plotly support.

Cons

Python’s community is very fractured on topics, especially when compared to R. It’s easier to find help on what you need when using R because of this. The community and some pivotal Python books are still split between Python 2 and Python 3. Python doesn’t have vectorization like R. For Loops are harder and slower to code and execute than R’s vectorization capabilities.

Older R versions are slow, and doing parallel programming wasn’t a thing for R, but that’s no longer a problem, but I think Python still might be a hair faster for most big data process. Though there’s RCPP and Cython if you ever in the future want lightning fast code. R gets bashed on a lot by Python only users. It’s annoying, but best ignored.

Twitter Digg Delicious Stumbleupon Technorati Facebook Email