Data Bytes: Snoring and Heart Disease

Intro


Hello and welcome to another edition of Data Bytes! This edition we’ll be analyzing the data of different levels of snoring to frequency of heart disease.
I got this data from A Handbook of Small Data Sets [1] (page 19, dataset 24)
I chose this data for a couple reasons:
1. We all know a snorer (sometimes its even us ^_^;)
2. This is good data to look into to encourage us to find the source of our snoring and help improve our sleep (and our partners as well)
3. Health data is always relevant and personal even at a larger scale.
So lets get to it.

Initial Glance

Our data comes in 5 columns by 2 rows and is in a nice contingency table for us already.
The data isn’t tidy with each row being a culmination of observations, but hey, we can’t all be neat and tidy all the time.
From a first glance it looks as if the majority of non-snorers have no heart disease, 1355 non to 1019 snorers of any frequency.
For those who do have heart disease we see that only 24 of them do not snore compared to 86 of people who snore.

Summary

no_snoreoccasionalfrequentevery_night
Count2222
Mean680319107127
STD927401120137
Min24352130
Max1335603192224

Hypothesis

Our summary stats aren’t very helpful in this circumstance. From our initial observation of the data we can make the hypothesis that there is a connection between snoring and heart disease.
Our null hypothesis is there’s no difference between rates of heart disease to snoring. We will test this hypothesis using a chi-squared test.
This will tell us how likely it is that our observations are independent. In other words is there an insignificant difference in heart disease rates between snorers and non-snorers?

Hypothesis Test

import pandas as pd
import scipy.stats as stats

data_frame = pd.read_csv("snoringheart.csv")

data_frame["heart_status"] = data_frame["heart_disease"].apply(lambda status: False if status == "no" else True)
data_frame.drop(["heart_disease"],axis=1,inplace = True)

stat,pval,dof,expected = stats.chi2_contingency(data_frame)
print(pval)
print(expected)

P-Value = 4.782785968461176e-19
Expected:
[
[61.1963489 28.7294118 9.59148073 11.4377282 0.0450304260]- with heart disease
[1297.80365 609.270588 203.408519 242.562272 0.954969574]- without
]

Observed:

[

[24,35,21,30] – with heart disease

[1335,603,192,224] – without

]

Conclusion

Well. I believe we can confidently reject the null hypothesis with this data. Our traditional p-value threshold is 0.05 and the one we calculated is far far less than that. If the data we observed was purely random happenstance we would be more likely to see 61 non-snorers and 49 snorers with heart disease. Instead what we observed is 24 non-snorers and 86 snorers with heart disease. As an added bonus our data and testing matches up with medical literature [2]!

Thank you for taking the time to read this article!

References

[1] D. J. Hand, F. Daly, A. D. Lunn etc. A Handbook of Small Data Sets
[2] Marin, Jose M., et al. “Long-term cardiovascular outcomes in men with obstructive sleep apnoea-hypopnoea with or without treatment with continuous positive airway pressure: an observational study.” The Lancet 365.9464 (2005): 1046-1053.

Data Byte: Fatness and ‘Sex’

Intro

I copied this data from The Handbook of Small Data Sets [1] (page 13, data set 17). Originally it was called Human age and fatness. According to the book the data came from a study done in the 80’s where researchers were investigating a new method of measure body fat percentage. They recorded 18 data points detailing age, sex, and fat percentage.

Initial Glance

The data has 18 rows and 3 columns. The data is also tidy with each row being a separate observation and each column being a separate variable. This is going to make our lives much easier since we won’t have to reshape the data. Also there is no missing data and every column has values we would expect i.e. age are all integers, fat are all floats, and ‘sex’ are all strings.

There are 4 ‘Male’ rows and 14 ‘Female’ rows. While we certainly couldn’t draw any major conclusions from our samples, especially the ‘Male’ data, there is enough here for us to run calculations on.

Our variables are:

  • Age
  • ‘Sex’
  • Fat Percentage

Summary Statistics

‘Male’

Population: 4

Mean: 13.1

Standard deviation: 8.1

Min: 7.8

Max: 27.4

‘Female’

Population: 14

Mean: 32.3

Standard Deviation: 4.72

Min: 25.2

Max: 42

Hypothesis

According to Active.com [2] the acceptable range of body fat % for ‘Male’ is between 18-25%. For ‘Female’ it is 25-31%.
My hypothesis is that the means for our two different groups do generally fall within range of ‘Acceptable’. To make calculations easy we will take the average of these two ranges. This gives us 21.5 for ‘Male’ and 28 for ‘Female.


Null Hypothesis: That our means are within the acceptable range of body fat percentage.
Alt Hypothesis: That our means are not within the acceptable range of body fat percentage.

Hypothesis Test

To test our hypothesis out we will use Python 3 and ttest_1samp from the Scipy package. Very briefly ttest_1samp is the T-Test for one group of data. We are operating under a 95% confidence rating. In order for us to reject the null hypothesis our p-value must be less 0.05.

from scipy.stats import ttest_1samp
Male_data = [9.5,7.8,17.8,27.4]
Female_data = [27.9,31.4,25.9,25.2,31.1,34.7,42.0,29.1,32.5,30.3,33.0,33.8,41.1,34.5]

print(ttest_1samp(Male_data, 21.5))

>>Ttest_1sampResult(statistic=-1.3079057076248999, pvalue=0.28210004133913985)

print(ttest_1samp(Female_data,28)))

>Ttest_1sampResult(statistic=3.2998921731172364, pvalue=0.005748911939385525)

Conclusions

The data we had for the ‘Male’ category was insufficient to reject the null hypothesis at a 95% confidence percentage. We are able to determine
this due to the high p-value i.e. its over 0.05
However the data we had for the ‘Female’ category was sufficient to reject our null hypothesis. In this case the p-value is far below 0.05

My guess is that the ‘Female’ category is more representative of ‘obese’ fat percentages. That is, fat percentages at or above 32%.

print(ttest_1samp(Female_data,32))
Ttest_1sampResult(statistic=0.24544652527318647, pvalue=0.8099429524799272)

Since the p-value is much higher than 0.05 my second hypothesis about the ‘Female’ data is supported by the data we currently have.

references

[1] https://books.google.com/books/about/A_Handbook_of_Small_Data_Sets.html?id=vWu-MJM_obsC

[2] https://www.active.com/fitness/calculators/bodyfat