# Adding a fitted Weibull distribution (fitdistr) to a geom_bar (ggplot2) categorical plot

I have created a barplot of Age vs. Population size (by gender) from Census data in ggplot2. Similarly, I have used the 'fitdist' function from the fitdistrplus package to derive Weibull parameters for the normalised (by maximum observed population across all Age bins) population data.

What I would like to do is to overlay the plotted data with the distribution as a line plot. I have tried

+ geom_line (denscomp(malefit.w))

Plus other numerous (unsuccessful) strategies.

Any help that could be provided would be much appreciated! Please find the syntax appended below:

Data Structure

Order Age Male Female Total male.norm 1 1 0 - 5 2870000 2820000 5690000 1.00000000 2 2 5 - 9 2430000 2390000 4820000 0.84668990 3 3 10 - 14 2340000 2250000 4590000 0.81533101 4 4 15 - 19 2500000 2500000 5000000 0.87108014 5 5 20 - 24 2690000 2680000 5370000 0.93728223 6 6 25 - 29 2540000 2520000 5060000 0.88501742 7 7 30 - 34 2040000 1990000 4030000 0.71080139 8 8 35 - 39 1710000 1760000 3470000 0.59581882 9 9 40 - 44 1400000 1550000 2950000 0.48780488 10 10 45 - 49 1200000 1420000 2620000 0.41811847 11 11 50 - 54 1010000 1210000 2220000 0.35191638 12 12 55 - 59 812000 985000 1800000 0.28292683 13 13 60 - 64 612000 773000 1390000 0.21324042 14 14 65 - 69 402000 556000 958000 0.14006969 15 15 70 - 74 293000 455000 748000 0.10209059 16 16 75 - 79 165000 316000 481000 0.05749129 17 17 80 - 84 101000 222000 323000 0.03519164 18 18 85 plus 75500 180000 256000 0.02630662 female.norm 1 1.00000000 2 0.84751773 3 0.79787234 4 0.88652482 5 0.95035461 6 0.89361702 7 0.70567376 8 0.62411348 9 0.54964539 10 0.50354610 11 0.42907801 12 0.34929078 13 0.27411348 14 0.19716312 15 0.16134752 16 0.11205674 17 0.07872340 18 0.06382979

## Answers

This is the answer to the original question I posed above. In conjunction with the data posted in the question it is a beginning to end solution (i.e. raw data to plot).

Fitting of South-African age-population data (by gender) to a Weibull distribution (Theresa Cain and Ben Small)

load libraries

library(MASS) library(ggplot2)

Import dataset

age_gender2 <- read.csv("age_gender2.csv", sep=",", header = T)

Define total population size by gender - that is sum the entire male / female population across all age bins and place in an objects 'total.male' and 'total.female' respectively

total.male <- sum(age_gender2$Male) total.female <- sum(age_gender2$Female)

The object 'age.groups' is a single row, single column vector describing the number of age bins for the 'age_gender2' df

age.groups <- length(age_gender2$Age)

The object 'age.all' is a 1 row, 18 column empty matrix that will describe the minimum age range extracted from the age bins (categories) in the 'Age' column from age_gender2 df

age.all <- matrix(0,1,age.groups)

Next line assigns min age to each element of matrix (1 X 18) for first column in each age group. So 'for' loop assigns each column of matrix as an age (HELP: writing a for loop in R).

Structure of the 'for' loop # RULE (given in parentheses()): for each element (i) loop from 2 to the value presented in the 'age.groups' object (i.e. 18) # COMMAND (given in curly brackets {}): taking each element (i) in the 'age.male' matrix and starting at the first row (i.e. [1, by each element (i.e. [1,i], perform / assign ('<-') the following operation: ((5 X (ith element - 1)) - 2.5). This operation provides the 'middle' age for the bin

this assigns the first element (row, column) in the 'age.all' matrix the value 2.5

age.all[1,1] <- 2.5 for(i in 2:age.groups){ age.all[1,i] <- ((5*(i)) - 2.5) }

This next command 'rep' creates a (1 X 25190500) vector of all the ages within a particular bin

male.data <- rep(age.all,age_gender2$Male) female.data <- rep(age.all,age_gender2$Female)

Fit weibull distribution to age for male and female

male.weib <- fitdistr(male.data, "weibull") female.weib <- fitdistr(female.data, "weibull") male.shape <- male.weib$estimate[1] male.scale <- male.weib$estimate[2] female.shape <- female.weib$estimate[1] female.scale <- female.weib$estimate[2]

Add column "Age_Median" to 'age_gender2' df with median age. Need to transpose as 'age.all' is an 1 row X 18 column vector.

age_gender2["Age_Median"] <- t(age.all)

Fit weibull distribution

The function 'pweibull' is a PDF and finds the cumulative probability over all ages, therefore we need to subtract the previous age bin(s) from the present bin to find the probability for that bin and hence (by multiplying by the total male population) the expected population for that bin.

male.p.weibull <- matrix(0,1,age.groups) female.p.weibull <- matrix(0,1,age.groups) for (i in 1:age.groups){ male.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, male.shape, male.scale) - pweibull(age.all[1,i]-2.5, male.shape, male.scale) } for (i in 1:age.groups){ female.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, female.shape, female.scale) - pweibull(age.all[1,i]-2.5, female.shape, female.scale) }

Add column to list calculated population per age bin - 'transpose' to 1 x 18 -> 18 row x 1 column vector

age_gender2["male.prob"] <- t(male.p.weibull * total.male) age_gender2["female.prob"] <- t(female.p.weibull * total.female)

Create bar plots describing Age-Gender population distributions

Males (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)

agp.male <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Male, fill=Male)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Male Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$male.prob, group=1))

Females (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)

agp.female <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Female, fill=Female)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Female Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$female.prob, group=1))