If you’ve ever taken a statistics class at a graduate-level program, most probably your instructor would have begun the course with the Central Limit Theorem (CLT). Before we get started, a refresher on CLT:

Regardless of the distribution of the population, the sampling distribution of sample means is normal, provided that the samples are randomly picked with replacement and the sample size is sufficiently large (n≥30).

In case, the population is normally distributed, the sampling distribution of the sample means is normally distributed even for lower sample sizes.

As we increase our sample size, the mean of the sampling distribution approaches the true mean (mean of the population). …

Imagine we are asked to develop insights and strategies for our customer base and the intention is to increase the profitability of the company. We are given very little time to come up with a plan. Knowing the fact that our organization would have many customers, developing strategies for each one of them might be redundant, exhaustive, and sometimes counter-productive. An effective tool to circumvent this lies with **Clustering**.

Clustering is an unsupervised machine learning technique that groups data points based on similarities. We will be focusing on perhaps the most used (or abused) technique called the K-means Clustering, where K refers to the number of segments you desire (yep, you have the power!). I refrain from going into technicalities (you will get enough content explaining these concepts on the web) and focus more on implementation in my articles. …

We’re predicting diamonds today, care to join?

This is an inbuilt dataset in R-studio. We intend to predict the diamond prices based on the features available. There are 53940 records in the dataset. As a ritual, let’s split the data into a test (30%) and train (70%).

Now that we have our training data, let us check the structure of the dataset. I am using R for my analysis.

A fun part of being a data analyst is an opportunity to learn across domains, and today is no different. Before we analyze the data further, let us understand what each column in the dataset means. And who do we have to help us with that? Gemological Institute of America (GIA). …

Let’s just admit it, we love linear regression! A magical straight line penetrating a cloud of data points is all that we desire. We spot a linear model with low root-mean-squared error (RMSE), high Adjusted R-squared and we claim to have estimated the true nature of the data. I wish things were this simple. Ever heard of the adage, all that glitters is not gold? Well, I intend to underscore the same.

All we do in this post is to predict the amount of power generated (in megawatts) based on the wind speed (in meters per second). Before we dive deep into the data, let’s wear our thinking hats for a while. How would power generated by a windmill vary based on the wind speed? Are we thinking linear, are we thinking exponential? Aren’t we alike? …