View on GitHub

DATA310

DATA 310 - Professor Frazier; MWF 1200-1250

Machine Learning Final Project

Rugby World Cup: Creating a Model that Predicts Point Differentials

For my final project, I chose to collect data on the rugby games played during the Rugby World Cup, and try to build a model that predicts the points by which a team won or lost a game. Variables included in my data are win or loss, point differential, meters run, runs, clean breaks, offloads, turnovers conceded, possession percent, territory percent, scrums won, scrums won percent, lineouts won, lineouts won percent, red cards, yellow cards, and penalties conceded.

In my attempt to get the best model possible, I ran three different models: linear, decision trees, and random forest. I also tried these three models on three different versions of the data. Because the data have a lot of dimensionality, I needed to figure out a way to do some dimensionality reduction. First, I used TSNE. Then, I tried to manually select a small number of variables I thought would be the best predictors of game outcomes. The variables I selected were meters run, clean breaks, offloads, turnovers conceded, lineouts won, and penalties conceded. The linear model on this data resulted in the lowest MSE obtained. I also wanted to see if the percentage of lineouts won would be a better variable to use than lineouts won, so I ran a third round of models using the same selected variables, just switching lineouts won for the variable indicating percentage of lineouts won. The results from this were not as good as the linear model from the previous round of models, so I concluded that lineouts won is the better variable to include in the model.