Intro
Hello and welcome to another edition of Data Bytes! This edition we’ll be analyzing the data of different levels of snoring to frequency of heart disease.
I got this data from A Handbook of Small Data Sets [1] (page 19, dataset 24)
I chose this data for a couple reasons:
1. We all know a snorer (sometimes its even us ^_^;)
2. This is good data to look into to encourage us to find the source of our snoring and help improve our sleep (and our partners as well)
3. Health data is always relevant and personal even at a larger scale.
So lets get to it.
Initial Glance
Our data comes in 5 columns by 2 rows and is in a nice contingency table for us already.
The data isn’t tidy with each row being a culmination of observations, but hey, we can’t all be neat and tidy all the time.
From a first glance it looks as if the majority of non-snorers have no heart disease, 1355 non to 1019 snorers of any frequency.
For those who do have heart disease we see that only 24 of them do not snore compared to 86 of people who snore.
Summary
| no_snore | occasional | frequent | every_night | |
| Count | 2 | 2 | 2 | 2 |
| Mean | 680 | 319 | 107 | 127 |
| STD | 927 | 401 | 120 | 137 |
| Min | 24 | 35 | 21 | 30 |
| Max | 1335 | 603 | 192 | 224 |
Hypothesis
Our summary stats aren’t very helpful in this circumstance. From our initial observation of the data we can make the hypothesis that there is a connection between snoring and heart disease.
Our null hypothesis is there’s no difference between rates of heart disease to snoring. We will test this hypothesis using a chi-squared test.
This will tell us how likely it is that our observations are independent. In other words is there an insignificant difference in heart disease rates between snorers and non-snorers?
Hypothesis Test
import pandas as pd
import scipy.stats as stats
data_frame = pd.read_csv("snoringheart.csv")
data_frame["heart_status"] = data_frame["heart_disease"].apply(lambda status: False if status == "no" else True)
data_frame.drop(["heart_disease"],axis=1,inplace = True)
stat,pval,dof,expected = stats.chi2_contingency(data_frame)
print(pval)
print(expected)
P-Value = 4.782785968461176e-19
Expected:
[
[61.1963489 28.7294118 9.59148073 11.4377282 0.0450304260]- with heart disease
[1297.80365 609.270588 203.408519 242.562272 0.954969574]- without
]Observed:
[
[24,35,21,30] – with heart disease
[1335,603,192,224] – without
]
Conclusion
Well. I believe we can confidently reject the null hypothesis with this data. Our traditional p-value threshold is 0.05 and the one we calculated is far far less than that. If the data we observed was purely random happenstance we would be more likely to see 61 non-snorers and 49 snorers with heart disease. Instead what we observed is 24 non-snorers and 86 snorers with heart disease. As an added bonus our data and testing matches up with medical literature [2]!
Thank you for taking the time to read this article!
References
[1] D. J. Hand, F. Daly, A. D. Lunn etc. A Handbook of Small Data Sets
[2] Marin, Jose M., et al. “Long-term cardiovascular outcomes in men with obstructive sleep apnoea-hypopnoea with or without treatment with continuous positive airway pressure: an observational study.” The Lancet 365.9464 (2005): 1046-1053.