R Tutorial : Poisson regression

162K subscribers

20,409 views

About
Share

Published On Apr 18, 2020

Want to learn more? Take the full course at https://learn.datacamp.com/courses/ge... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.

---

In the previous section, you learned how GLMs extend linear models. Now, you will learn about a GLM used to model count data: Poisson regressions.

Both in data science and everyday life, we often have counts. For example, our favorite team hopefully scores goals during a game. Likewise, our favorite player scores over the season. We might count trees or cancer cells. On our web page, we might look at the visitors per hour or clicks per minute. If these values are large enough, we can assume they are "normal". Often times, however, this is not true. Hence, we need a new tool for count data, the Poisson distribution.

This plot demonstrates two Poisson distributions, plotted in blue, and corresponding normal distribution in red. Notice the differences between them. The Poisson only has positive integers and lacks symmetry near zero.

In contrast to the normal distribution which models continuous numbers from negative to positive infinity, the Poisson models integers greater than or equal to zero. Additionally, the Poisson's mean and variance are the same parameter, lambda. This contrasts with the normal, which has 2 parameters: A mean and a variance. For a fixed time interval and area, the probability of x observations are a function of lambda and x. For example, we could use the Poisson to model the number of goals scored in one game. But we could not use the number of goals scored in one game or in one season.

To run a GLM with Poisson family, our y-data must be counts. For example, 1,2, or 3. Not, 1.5 or 4.1. We must also know the time and area from which our counts were taken. Last, the coefficients from a Poisson GLM are on the log scale. We will see why in chapter 3.

As alluded to in the previous slide, there are times when you should not use a Poisson regression. For example, if your data are not counts or includes negative values you cannot fit the model. Likewise, if you have a non-constant sample area or time, do not use a Poisson distribution because your inferences and model will be meaningless. Also, if your mean is greater than 30, you can probably use a normal distribution.
If your data has a variance that is greater than the mean, your data is over-dispersed and you need another tool besides a Poisson. On a similar theme, if you have too many zeros, your data is zero-inflated and you need another tool as well.

When using formulas for either GLMs or LMs, you have two options for intercepts. Estimating a comparison to a reference level or estimating one intercept per group. The comparisons can be useful when we want to examine the difference between two group. This is the default in R. The intercepts per group can be useful when we care about the average per group. This needs a "minus 1".

For example, if we have two players and we want to compare goals per game, you could use either approach.

You would use the default formula to estimate the difference between the two players.
You would use the intercepts option to estimate the average goals per player.
You'll see this in action during an exercise.

Now, let's use Poisson regressions!

#DataCamp #RTutorial #GeneralizedLinearModelsinR

Published On Apr 18, 2020

Share/Embed

Video Link