Data Exploration and Analysis

The data for this analysis was sourced from GitHub and Kaggle, encompassing various datasets. These datasets comprised team box score data dating back to 2003, play-by-play data from the 2021-2022 season, player season statistics dating to 1996, and information regarding players' countries of origin. Additionally, we accessed historical team data, which included team records and player attributes. Below are the snippets of preprocessed data, where we reformatted the attributes and deleted the repeating and redundant columns and rows of unused data. Furthermore, datasets were consolidated through merging processes.

Exploration & Analysis

In this subsection, a range of questions were posed and examined using the available data, yielding valuable insights into prevalent basketball topics as well as less-discussed subjects.

Analyzing the distributions of key quantitative attributes

Quantitative variables were selected from the dataset for description and analysis, including total points scored, field goal percentage, three-point shot percentage, free throw percentage, and offensive rebounds. The objective of this analysis was to examine the distribution of these metrics in conjunction with other standard statistical measures, all of which were applied to the team's game-by-game box score dataset. R was employed to calculate key statistics such as the mean, median, mode, range, variance, standard deviation, quartiles, and interquartile range for each variable, accompanied by various exploratory data analysis (EDA) techniques to address additional questions of interest.

Histograms

The data is represented by five histograms, illustrating the distributions of total points scored, field goal percentage, three-point shot percentage, free throw percentage, and offensive rebounds. The initial three histograms exhibit an approximately normal distribution pattern, while the free throw percentage displays a slight left-skew, and the offensive rebounds demonstrate a right-skewed distribution. The normality of the first three distributions is supported by the stats where the mean, median, and mode closely align, characteristic of normal distributions. Conversely, the skewness of the last two distributions is also evident by the statistics. Specifically, in the case of free throw percentage, the mean is lower than the median, consistent with a left-skewed distribution, while offensive rebounds exhibit a slightly right-skewed distribution, as indicated by the mean being slightly greater than the median.

The above observations also hold true when verified using boxplots, density plots, and Q-Q plots.

Box Plots

The five box plots portray distributions for total points scored, field goal percentage, three-point shot percentage, free throw percentage, and offensive rebounds, all revealing the presence of outliers. These outliers align with our expectations, considering the characteristics observed in the histograms. The first three box plots maintain their approximate normality despite the presence of outliers, evenly distributed between upper and lower values. In contrast, the box plot for free throw percentage primarily displays lower outliers, consistent with a left-skewed distribution, while offensive rebounds exhibit a predominance of upper outliers, in line with a slightly right-skewed distribution.

Kernel-Density Plots

Five density plots, utilizing Gaussian and rectangular kernels, illustrate the distributions for total points scored, field goal percentage, three-point shot percentage, free throw percentage, and offensive rebounds. The first three variables display approximately normal distributions, while the density plot for free throw percentage skews left, and the plot for offensive rebounds exhibits a slight right-skew, aligning with expectations. Notably, the Gaussian kernel consistently outperforms the rectangular kernel, providing smoother curves that more accurately represent the underlying distributions for all variables.

Q-Q Plots

The first three plots feature nearly straight lines, signifying approximate normality, in line with observations from previous analyses. The QQ plot for free throw percentage displays a concave curve, consistent with the expected skewness. Similarly, the QQ plot for offensive rebounds exhibits a convex curve, aligning with the anticipated skewness, as previously discussed.

Analyzing Free Throws

Free throws hold considerable significance in the NBA, often regarded as easy points. In tight contests, the outcome can hinge on free-throw success. Drawing fouls is also advantageous, indicating effective offense and scoring opportunities. Thus, the question arises: do teams secure more victories when they attempt and make more free throws? The chart below juxtaposes free throws attempted and free throw percentage for each game, with dashed lines representing the team averages for these metrics. Each data point is color-coded, with red indicating a loss and turquoise signifying a win. Analyzing the four California teams reveals the need for additional context. For instance, the Sacramento Kings tended to win most when they attempted more free throws than their average, whereas the Golden State Warriors achieved more victories when they made a higher-than-average number of free throws. In contrast, both Los Angeles teams displayed minimal correlation in this regard. Consequently, further context, such as a team's proficiency in three-point shooting or the presence of a top performer in free throws, becomes crucial in understanding the dynamics at play.

The plot on the side represents the proportion of points scored in free throws over five years in the NBA league.
In this context, it is clear that the proportion of points derived from free throws, almost accounting for one-fifth of total points scored in the league over a calendar year, has been consistently stable over the years and is a respectable proportion when compared to the total points.

Correlation Analysis

A basic correlation analysis was performed to see the impact points scored from free throws have on the total amount of points scored in a game.

From the plot above, we can see a moderate correlation between free throws scored and total points scored. However, we can expect that 3-point shots and total points have a higher correlation, especially in today’s NBA.

Analyzing the Field Goal Percentage over the years

Field goal percentage has been improving since 2011. Despite modest variances for a few years, the overall percentage is better than it was in the past. The league saw the emergence of many outstanding players from universities and other nations,which heightened the level of competition and forced players to practice and play hard to keep their contracts

Does executing a slam dunk give the team a momentum boost? And, if so, how long does the momentum boost last?

The following is an example of a game where a dunk or dunks could have added the momentum needed to turn the tide of the game. A positive score margin indicates that the home team has the lead. Each point represents the margin measured whenever a line was added to the box score. The dark blue curve is simply a smoothed function fit to the data. The light blue vertical lines indicate times when the visiting team successfully completed a slam dunk, while the red lines indicate when the home team does the same. We are interested in exploring the potential momentum boosts generated by performing such a dynamic play. The visiting team's two dunks between the 30th and 40th minutes of the game are a possible example of this effect in action. Especially following the latter dunk, you can see that the rate of change of the score margin shifts significantly in favor of the visiting team.

Is there a correlation between a team’s average age and winning percentage?

Density plots and scatterplots were employed to investigate the aforementioned question using data spanning the years 1980 to 2017. Analyzing the density plots allows us to gain insight into the distribution of the average age among NBA teams, with the majority falling within the 26 to 28 age range. Delving into the density plots by decade, a general trend emerges: teams tended to be younger on average during the 1980s and 2010s compared to the 1990s and 2000s.

The scatterplots, on the other hand, provide a more direct depiction of the correlation between average age and winning percentage. In summary, when examining the entire dataset as well as dividing it by decades, we observe a moderate correlation. A closer look at the individual decades reveals that the 1990s and 2020s exhibit a slightly stronger correlation than the 1980s and 2010s.

Data

R Code

Hypothesis Testing