R Coding Exercise

Placeholder file for the future R coding exercise.

Loading and Checking Data

I will now install the dslabs package. Then, I will load the package.

options(repos = c(CRAN = "https://cran.rstudio.com/")) # including this because I could not render the website without it
install.packages("dslabs")

The downloaded binary packages are in
    /var/folders/n7/2hn_592d3db_b95vx1v5wcjr0000gn/T//RtmpYdMzpU/downloaded_packages
library(dslabs)

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

I will now use the “help()” function on the “gapminder” data.

help(gapminder)

From this, I learned that the gapminder data contains outcomes related to health and income for 184 countries between 1960 and 2016. oecd = OECD countries from 2016 opec = OPEC countries form 2016

Now, I will use str() on the gapminder data.

str(gapminder)
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

Above, I can see that the gapminder data frame contains 9 variables and 10,545 observations. The variables are: country, year, infant_mortality, life_expectancy, fertility, population, gdp, continent, and region. Some of the variables are numerics or intergers, while others are factors (country, continent, and region).

I will use summary() on the gapminder data.

summary(gapminder)
                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586  

The summary above displays the countries, continents, and regions. It also shows the minimum, 1st quartile, median, mean, 3rd quartile and maximum values for the numeric/integer variables.

Now, I will determine the type of object/class that gapminder is.

class(gapminder)
[1] "data.frame"

This output states that gapminder is a data frame.

Processing Data

Now, I will filter the gapminder data to obtain only the countries in Africa. I will assign this to a variable called “africadata”.

africadata <- filter(gapminder, continent == "Africa")

I will run str() on “africadata”.

str(africadata)
'data.frame':   2907 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
 $ population      : num  11124892 5270844 2431620 524029 4829291 ...
 $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

As you can see, there are still 9 variables (each of the same data class as before). However, there are fewer observations (2907) than before (10,545) - because we are only included the African countries here.

I will run summary() on “africadata”.

summary(africadata)
         country          year      infant_mortality life_expectancy
 Algeria     :  57   Min.   :1960   Min.   : 11.40   Min.   :13.20  
 Angola      :  57   1st Qu.:1974   1st Qu.: 62.20   1st Qu.:48.23  
 Benin       :  57   Median :1988   Median : 93.40   Median :53.98  
 Botswana    :  57   Mean   :1988   Mean   : 95.12   Mean   :54.38  
 Burkina Faso:  57   3rd Qu.:2002   3rd Qu.:124.70   3rd Qu.:60.10  
 Burundi     :  57   Max.   :2016   Max.   :237.40   Max.   :77.60  
 (Other)     :2565                  NA's   :226                     
   fertility       population             gdp               continent   
 Min.   :1.500   Min.   :    41538   Min.   :4.659e+07   Africa  :2907  
 1st Qu.:5.160   1st Qu.:  1605232   1st Qu.:8.373e+08   Americas:   0  
 Median :6.160   Median :  5570982   Median :2.448e+09   Asia    :   0  
 Mean   :5.851   Mean   : 12235961   Mean   :9.346e+09   Europe  :   0  
 3rd Qu.:6.860   3rd Qu.: 13888152   3rd Qu.:6.552e+09   Oceania :   0  
 Max.   :8.450   Max.   :182201962   Max.   :1.935e+11                  
 NA's   :51      NA's   :51          NA's   :637                        
                       region   
 Eastern Africa           :912  
 Western Africa           :912  
 Middle Africa            :456  
 Northern Africa          :342  
 Southern Africa          :285  
 Australia and New Zealand:  0  
 (Other)                  :  0  

This summary is similar to the one created for the entire gapminder data frame, however the values above are now different for the numeric/integer variables since we are only looking at the African countries.

Now, I will create a new object that contains only the infant_mortality AND life_expectancy from the “africadata” data frame.

infantmort_lifeexp <- select(africadata, infant_mortality, life_expectancy)

I will use str() and summary() to take a look at the infantmort_lifeexp data.

str(infantmort_lifeexp)
'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
summary(infantmort_lifeexp)
 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226                     

These outputs show that the only two variables are infant_mortality and life_expectancy. Then the minimum, 1st quartile, median, mean, 3rd quartile, and maximum are displayed for infant_mortality and life_expectancy.

Then, I will create a new object containing only population AND life-expectancy from the “africadata” data frame.

pop_lifeexp <- select(africadata, population, life_expectancy)

I will use str() and summary() to take a look at the pop_lifeexp data.

str(pop_lifeexp)
'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...
summary(pop_lifeexp)
   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51                         

These outputs show that the only two variables are population and life_expectancy. Then the minimum, 1st quartile, median, mean, 3rd quartile, and maximum are displayed for population and life_expectancy.

Plotting

I will now create a plot of life_expectancy as a function of infant_mortality.

ggplot(infantmort_lifeexp, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point(color = "#5fafdc") + # I will specify the points to be blue 
  labs(title = "Life Expectancy as a Function of Infant \n Mortality in African Countries",
       x = "Infant Mortality (per 1,000)",
       y = "Life Expectancy (years)") + # I will specify the title and the names of the x and y axes 
  theme(legend.position = "none", plot.title = element_text(size = 18, face = "bold", hjust = 0.5), axis.title.x = element_text(size = 12, face = "bold"), axis.title.y = element_text(size = 12, face = "bold")) # I will remove a legend (since it is unneeded here) and make the title and axis labels larger and bold 
Warning: Removed 226 rows containing missing values or values outside the scale range
(`geom_point()`).

As you can see from the plot above, there appears to be a negative relationship between infant mortality and life expectancy. This makes sense, because if there is higher infant mortality in a country, their overall life expectancy should be lower.

I will now create a plot of life_expectancy as a function of population (size). The population (x-axis) will be set to a log scale.

ggplot(pop_lifeexp, aes(x = population, y = life_expectancy)) +
  geom_point(color = "#70dc5f") + # I will specify the points to be green  scale_x_log10() + 
  scale_x_log10() + # I will set the x-axis to a log scale
  labs(title = "Life Expectancy as a Function of Population Size \n (Log Scale) in African Countries",
       x = "Population (Log Scale)",
       y = "Life Expectancy (years)") + # I will specify the title and the names of the x and y axes
  theme(legend.position = "none", plot.title = element_text(size = 18, face = "bold", hjust = 0.5), axis.title.x = element_text(size = 12, face = "bold"), axis.title.y = element_text(size = 12, face = "bold")) # I will remove a legend (since it is unneeded here) and make the title and axis labels larger and bold
Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_point()`).

The plot above appears to have individual streaks which have a positive relationship between population size and life expectancy. However, as a whole the data appears to be relatively scattered.

There appear to be “streaks” of data in both plots above. This is because we are measuring individual countries over several different years.

More Data Processing

I will find out which years contain NAs.

years_with_na_infant_mortality <- unique(africadata$year[is.na(africadata$infant_mortality)])
# use unique() and is.na() to find out which years contain NAs in infant_mortality

print(years_with_na_infant_mortality)
 [1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
[16] 1975 1976 1977 1978 1979 1980 1981 2016
# Use print() to view the output

As you can see, the years 1960-1981 and 2016 contain NA values within the infant_mortality variable.

We will use the year 2000’s infant mortality data since it contains no missing values. I will create a new object that contains only data for the year 2000 from the “africadata” data frame.

africadata_2000 <- filter(africadata, year == 2000)

I will check that this worked with str() and summary().

str(africadata_2000)
'data.frame':   51 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
 $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
 $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
 $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
 $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(africadata_2000)
         country        year      infant_mortality life_expectancy
 Algeria     : 1   Min.   :2000   Min.   : 12.30   Min.   :37.60  
 Angola      : 1   1st Qu.:2000   1st Qu.: 60.80   1st Qu.:51.75  
 Benin       : 1   Median :2000   Median : 80.30   Median :54.30  
 Botswana    : 1   Mean   :2000   Mean   : 78.93   Mean   :56.36  
 Burkina Faso: 1   3rd Qu.:2000   3rd Qu.:103.30   3rd Qu.:60.00  
 Burundi     : 1   Max.   :2000   Max.   :143.30   Max.   :75.00  
 (Other)     :45                                                  
   fertility       population             gdp               continent 
 Min.   :1.990   Min.   :    81154   Min.   :2.019e+08   Africa  :51  
 1st Qu.:4.150   1st Qu.:  2304687   1st Qu.:1.274e+09   Americas: 0  
 Median :5.550   Median :  8799165   Median :3.238e+09   Asia    : 0  
 Mean   :5.156   Mean   : 15659800   Mean   :1.155e+10   Europe  : 0  
 3rd Qu.:5.960   3rd Qu.: 17391242   3rd Qu.:8.654e+09   Oceania : 0  
 Max.   :7.730   Max.   :122876723   Max.   :1.329e+11                
                                                                      
                       region  
 Eastern Africa           :16  
 Western Africa           :16  
 Middle Africa            : 8  
 Northern Africa          : 6  
 Southern Africa          : 5  
 Australia and New Zealand: 0  
 (Other)                  : 0  

This worked, as you can see all the years appear as “2000”.

More Plotting

I will make the same plots as before, but this time with the “africadata_2000” data.

I will create a plot of life_expectancy as a function of infant_mortality.

ggplot(africadata_2000, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point(color = "#f576c1") + # I will specify the points to be pink 
  labs(title = "Life Expectancy as a Function of Infant Mortality \n in African Countries in Year 2000",
       x = "Infant Mortality (per 1,000)",
       y = "Life Expectancy (years)") + # I will specify the title and the names of the x and y axes 
  theme(legend.position = "none", plot.title = element_text(size = 18, face = "bold", hjust = 0.5), axis.title.x = element_text(size = 12, face = "bold"), axis.title.y = element_text(size = 12, face = "bold")) # I will remove a legend (since it is unneeded here) and make the title and axis labels larger and bold 

You can see that the plot above has fewer data points that the previous plot because only data from the year 2000 for each country is included.

I will now create a plot of life_expectancy as a function of population (size). The population (x-axis) will be set to a log scale.

ggplot(africadata_2000, aes(x = population, y = life_expectancy)) +
  geom_point(color = "#f7af32") + # I will specify the points to be orange  
  scale_x_log10() + # I will set the x-axis to a log scale
  labs(title = "Life Expectancy as a Function of Population Size \n (Log Scale) in African Countries in Year 2000",
       x = "Population (Log Scale)",
       y = "Life Expectancy (years)") + # I will specify the title and the names of the x and y axes
  theme(legend.position = "none", plot.title = element_text(size = 18, face = "bold", hjust = 0.5), axis.title.x = element_text(size = 12, face = "bold"), axis.title.y = element_text(size = 12, face = "bold")) # I will remove a legend (since it is unneeded here) and make the title and axis labels larger and bold

The plot above also has fewer data points than the previous plot because it contain contains a data point from the year 2000 for each country.

Simple Model Fits

I will now use the lm() function to fit a linear model with life_expectancy as the outcome and infant_mortality as the predictor. Then, I will apply the summary() function.

fit1 <- lm(life_expectancy ~ infant_mortality, data = africadata_2000)
summary(fit1)

Call:
lm(formula = life_expectancy ~ infant_mortality, data = africadata_2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08

The intercept (a) is 71.29331; this means that the average life expectancy is 71.29331 years when infant mortality is 0. The slope (b) is -0.18916; this means that for every 1 unit increase in infant mortality, life expectancy decreases by 0.18916 years.

I will now use the lm() function to fit a linear model with life_expectancy as the outcome and population size as the predictor. Then, I will apply the summary() function.

fit2 <- lm(life_expectancy ~ population, data = africadata_2000)
summary(fit2)

Call:
lm(formula = life_expectancy ~ population, data = africadata_2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.429  -4.602  -2.568   3.800  18.802 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
population  2.756e-08  5.459e-08   0.505    0.616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

The intercept (a) is 5.593e+01; this means that the average life expectancy is 55.93 years when the population size is equal to 0. The slope (b) is 2.756e-08; this means that for every 1 unit increase in population size, life expectancy increases by 2.756e-08 years. This intrepretation does not make sense - because if the population size is 0, the average life expectancy should be non-existent.

The p-value for “fit1” is 2.826e-08, while the p-value for “fit2” is 0.6159. 2.825e-08 is less than 0.05, therefore, the null hypothesis that there is no relationship between infant mortality and life expectancy can be rejected. 0.6159 is greater than 0.05, the null hypothesis that there is no relationship between population size and life expectancy cannot be rejected (we fail to reject). With this information, it can be said that infant mortality is a better predictor of life expectancy than population size. Therefore, “fit1” should be used.

Part 2

This section is contributed by Yufei Wu.

Dataset exploration

I choose the dataset “mice_weights” which contains body weights, bone density, and percent fat for mice under two diets: chow and high fat.

# load package
library(dslabs)
#look at help file for mice_weights data
help(mice_weights)
#get an overview of data structure
str(mice_weights)
'data.frame':   780 obs. of  7 variables:
 $ body_weight : num  27.6 23 28.7 32.6 28.6 ...
 $ bone_density: num  0.616 0.769 0.684 0.644 0.53 ...
 $ percent_fat : num  7.26 4.95 6.02 9.54 6.99 ...
 $ sex         : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ diet        : Factor w/ 2 levels "chow","hf": 1 1 1 1 1 1 1 1 1 1 ...
 $ gen         : Factor w/ 5 levels "4","7","8","9",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ litter      : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#get a summary of data
summary(mice_weights)
  body_weight     bone_density     percent_fat     sex       diet     gen     
 Min.   :18.13   Min.   :0.2708   Min.   : 2.552   F:398   chow:394   4 : 97  
 1st Qu.:28.09   1st Qu.:0.4888   1st Qu.: 5.566   M:382   hf  :386   7 :195  
 Median :32.98   Median :0.5643   Median : 8.276                      8 :193  
 Mean   :34.08   Mean   :0.5697   Mean   : 8.594                      9 : 97  
 3rd Qu.:39.37   3rd Qu.:0.6373   3rd Qu.:10.926                      11:198  
 Max.   :65.15   Max.   :0.9980   Max.   :22.154                              
                 NA's   :4        NA's   :4                                   
 litter 
 1:442  
 2:338  
        
        
        
        
        
#determine the type of object mice_weights is
class(mice_weights)
[1] "data.frame"
#get the overview of the data
head(mice_weights)
  body_weight bone_density percent_fat sex diet gen litter
1       27.60    0.6163850    7.255468   F chow   4      1
2       23.03    0.7693496    4.951037   F chow   4      1
3       28.72    0.6842564    6.020849   F chow   4      1
4       32.57    0.6436947    9.536251   F chow   4      1
5       28.61    0.5297713    6.987331   F chow   4      1
6       28.16    0.5649217    6.767774   F chow   4      1

Data processing

Creat a new data frame containing only famale rats.

#assign only the female rat to a new object called femalerat
femalerat <- mice_weights[mice_weights$sex == "F", ]

#check the structure
str(femalerat)
'data.frame':   398 obs. of  7 variables:
 $ body_weight : num  27.6 23 28.7 32.6 28.6 ...
 $ bone_density: num  0.616 0.769 0.684 0.644 0.53 ...
 $ percent_fat : num  7.26 4.95 6.02 9.54 6.99 ...
 $ sex         : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ diet        : Factor w/ 2 levels "chow","hf": 1 1 1 1 1 1 1 1 1 1 ...
 $ gen         : Factor w/ 5 levels "4","7","8","9",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ litter      : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#check the summary
summary(femalerat)
  body_weight     bone_density     percent_fat     sex       diet     gen     
 Min.   :18.13   Min.   :0.2708   Min.   : 2.766   F:398   chow:200   4 : 50  
 1st Qu.:25.35   1st Qu.:0.4531   1st Qu.: 5.690   M:  0   hf  :198   7 : 98  
 Median :28.71   Median :0.5090   Median : 8.181                      8 :100  
 Mean   :29.76   Mean   :0.5186   Mean   : 8.683                      9 : 50  
 3rd Qu.:33.21   3rd Qu.:0.5793   3rd Qu.:11.017                      11:100  
 Max.   :65.15   Max.   :0.8519   Max.   :22.154                              
                 NA's   :1        NA's   :1                                   
 litter 
 1:224  
 2:174  
        
        
        
        
        
#get the overview of the data
head(femalerat)
  body_weight bone_density percent_fat sex diet gen litter
1       27.60    0.6163850    7.255468   F chow   4      1
2       23.03    0.7693496    4.951037   F chow   4      1
3       28.72    0.6842564    6.020849   F chow   4      1
4       32.57    0.6436947    9.536251   F chow   4      1
5       28.61    0.5297713    6.987331   F chow   4      1
6       28.16    0.5649217    6.767774   F chow   4      1

Figures and tables

#load package
library(ggplot2)
library(dplyr)

#make plots of body weight vs body density, coloring by diet
ggplot(femalerat, aes(x = body_weight, y = bone_density, color = diet)) +
  geom_point() +
  labs(
    title = "Relationship between body density and body weight",
    x = "Body weight",
    y = "Body density",
    color = "Diet")

#make plots of body weight vs percent fat, coloring by diet
ggplot(femalerat, aes(x = body_weight, y = percent_fat, color = diet)) +
  geom_point() +
  labs(
    title = "Relationship between percent fat and body weight",
    x = "Body weight",
    y = "Percent fat",
    color = "Diet")

#summarize the mean and standard deviation of body weight grouped by diet
summary_table <- femalerat %>%
  group_by(diet) %>%
  summarise(
    mean_weight = mean(body_weight, na.rm = TRUE),
    sd_weight = sd(body_weight, na.rm = TRUE),
    mean_percent_fat = mean(percent_fat, na.rm = TRUE),
    sd_percent_fat = sd(percent_fat, na.rm = TRUE),
    count = n()
  )
print(summary_table)
# A tibble: 2 × 6
  diet  mean_weight sd_weight mean_percent_fat sd_percent_fat count
  <fct>       <dbl>     <dbl>            <dbl>          <dbl> <int>
1 chow         27.8      4.82             6.86           2.61   200
2 hf           31.7      7.06            10.5            3.75   198

From the first figure, we cannot see a clear relationship between body density and body weight. From the second figure, we can see that there is a positive relationship between percent fat and body weight. Besides, there are more points from high fat group locating in higher percent fat. From the table, we can see that the body weight and percent fat are both higher in the high fat group than chow group, which can be further explored.

Statistical model

Since I noticed that body weight and percent fat are both higher in the high fat group than chow group, I will test if the differences are statistically significant.

#run t-test for body weight comparison
t_test_weight <- t.test(body_weight ~ diet, data = femalerat)
print(t_test_weight)

    Welch Two Sample t-test

data:  body_weight by diet
t = -6.3936, df = 347.4, p-value = 5.224e-10
alternative hypothesis: true difference in means between group chow and group hf is not equal to 0
95 percent confidence interval:
 -5.073552 -2.686420
sample estimates:
mean in group chow   mean in group hf 
          27.82860           31.70859 
#run t-test for percent fat comparison
t_test_fat <- t.test(percent_fat ~ diet, data = femalerat)
print(t_test_fat)

    Welch Two Sample t-test

data:  percent_fat by diet
t = -11.316, df = 349.56, p-value < 2.2e-16
alternative hypothesis: true difference in means between group chow and group hf is not equal to 0
95 percent confidence interval:
 -4.310618 -3.034073
sample estimates:
mean in group chow   mean in group hf 
          6.860235          10.532581 

Report the results

For the body weight comparison, the difference between high fat group and chow group is 3.88, p value is less than 0.05, so the mean body weight differs significantly between the high Fat and chow groups. For the percent fat comparison, the difference between high fat group and chow group is 3.67, p value is less than 0.05, so the mean percent fat differs significantly between the high Fat and chow groups.