Introducing R For Sport Science Data Analysis

To date, all our work has been carried within Excel whether through PowerQuery or PowerPivot. The rationale for carrying out much of this within Excel was that I felt much of the audience would be comfortable and possibly experienced within Excel. However Excel does have limitations to it’s ability; formulas can be easily corrupted; can be slow dealing with large data sets (100,000+ rows); models need daily maintenance;  replicating graphs/charts multiple times can be very time consuming plus many other issues I’m sure people come across in their environments.

Screen Shot 2018-10-05 at 15.15.27

This is where we can look to other softwares/platforms specifically designed to deal with and analyse large data sets or groups of files together. There are a number of different platforms available for us to look into: R/RStudio; Python; Microsoft SQL Server; PowerBI; MATLab; SAS etc (I refuse to add SPSS to the list due to painful university stats lessons ;)). In order to fully utilise any of the above, we have to look at learning a type of programming language (R, Python, SQL, DAX, MATLAB). I believe then its a question of what are looking to do in our analysis and which platform will allow us to fully satisfy our needs. These are some of the areas that the above platforms have been used for:

  • Statistical analysis
    • Streamline use of statistical analysis in day-to-day work (Assess real change, Noise Vs Signal, longitudinal monitoring etc)
      • Create reports which automated individualised analyses of player data (See here for an example from Mathieu Lacome)
  • Data Manipulation
    • Collate multiple data sources into one data frame
    • Manipulate data exports in usable formats
  • Working with a large number of files (i.e. dealing with individual data exports)
    • Create loops or functions which automate dealing with multiple files (Useful in settings with multiple teams or sports and large numbers of athletes)
      • Jace Delaney (and here) has carried out research on GPS data using these processes
  • Utilise API facility from other softwares
    • Automate data extraction and analysis to reduce time required
  • Deal with multiple file types (xlsx, txt, csv, json, xml etc.)
  • Create visuals and reports
  • Create web based applications

From my limited experience, 2 of the above platforms will satisfy all of the criteria listed which are R or Python. Lets now look at both of those in more detail.

Running the base version of either R or Python isn’t the easiest for beginners or the most intuitive version available to use. Fortunately, there are different platforms available to make learning and using both systems easier for us. Both R and Python need to have their “base” system installed before we can use the user friendly systems for them. Anaconda is a platform through which we can download Jupyter Notebooks (for both R and Python) and RStudio (for R only). While my experience of RStudio is more extensive than in Jupyter (Fun fact…Jupyter comes from combining 3 computer languages Julia, Python & R together), I do find it to be a much better system to work in (RStudio is an Interactive Development Environment (IDE) specifically designed for R so this may be expected). While there are a number of IDEs available for Python, my understanding is none match up to either Jupyter Notebooks or RStudios ability

DojON_gXkAERHnI.jpg-large
RStudio

Installing Jupyter and RStudio through Anaconda means that we will have a number of the main packages used in both R and Python already installed (Confused yet?). WTF is  a package I hear you ask? As both are considered open source platforms, developers around the world have kindly put together packages which speed up some regularly used processes and can reduce the need for long sections of code.

If we didn’t have Anaconda we would have to manually install a lot of these packages ourselves. Different packages are designed for different tasks, ggplot2 is about graphs and charts, googlesheets streamlines pulling information from Google Sheets, lubriDate helps working with times and data in R etc. There are hundreds of packages available some dealing with widespread areas some with very minor areas. As packages are different between R and Python going forward we will mainly be working in R, more specifically RStudio.

I will finish with a word of caution for those experienced in other programming languages and for those new to the world. For the experienced, R is a vectorised programming language which allows it to easily perform some operations that may require complicated loops in other languages but can make moving from other languages to R difficult at times. For anyone new to R, it has a very very steep learning curve but once you get over that it allows you to perform some amazing work, while you are trying to get past the learning curve take the time to explore and get things wrong in R.

R related posts:

Where to go Help Learn R

Raw GPS Data Analysis: Part 1, Part 2, Part 3

Visualising Data in R: Part 1, Part 2

Building Custom Wellness Monitoring System: Part 1, Part 2, Part 3

Using R in PowerBI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑

%d bloggers like this: