STAT1013: Practical Assignment Part 1: Sharing Your Idea and Data (25 Points)

STAT1013: Practical Assignment Part 1: Sharing Your Idea and Data (25 Points)

This semester, you will complete a practical assignment. You will submit this assignment in TWO parts. This handout describes the first part of the assignment, where you will share your idea and data.

For the practical assignment, the main goal is to compare averages from two population groups.

You will gather quantitative data (a response from each subject is a number!) from two samples to compare two population means. The samples will either be independent (two groups are not connected/related) or paired. After you gather your data, you will share this data, describe each sample separately using appropriate graphs and summary statistics obtained from the statistical software package of your choice, and then use statistical software to perform a two-sample t-test or a matched-pairs/paired t-test depending on the kind of data you have collected. The entire practical assignment (i.e., all three parts together) is worth a total of 100 points.

Coming up with an Idea

Often, one of the most difficult parts of the research process is coming up with an idea! To make this assignment meaningful, try to think of something that you would be interested in knowing about. Or, think of some data that you have already gathered, or that you have easy access to. Just keep in mind that you will be gathering quantitative data (not categorical data) from two different groups.

In each of your samples, you will need at least 30 subjects. Subjects can be individuals (if you are comparing males and females study hours, for example, the subjects would be individual males and females— so you’d need a total of 60 different people in your data set), or subjects can be other things (for example, if you are comparing average house prices in Hong Kong and New York City, the subjects are houses, and you’d need at least 60 house prices in your data set). Subjects can even be food items. You may select to compare the prices of the same products at two different stores. You would therefore need to gather data on at least 30 products, and you would end up with 60 prices (30 from one store and 30 from the other store). From each subject, you need to be able to gather the same kind of quantitative data (with the same unit). So, for example, if you plan to compare the prices of food items at two stores, you will gather information about the PRICE of each food item; if you plan to compare male and female college students in terms of how many hours they spend studying per week, you will gather data on this quantitative variable (HOURS studied) from EACH student. In the first example, food items are subjects, and in the second example, students are subjects. If your project is one where you are comparing school districts’ average teacher salaries, your subjects would be school districts.

Note that from the above examples, the data you are collecting (price of food item, number of study hours, salary of teachers) are quantitative (numbers). If you collect categorical data, you cannot calculate average and cannot complete this assignment.

To help you brainstorm, consider some of these ideas that past students have pursued.

  • Do college female students have different grade point averages (GPA) than college male students? (two groups : male vs female, response variable : GPA)
  • Do dog treats placed on the top shelf at a pet store cost a different amount than dog treats placed on the bottom shelf? (two groups : dog treats placed on the top shelf vs bottom shelf, response variable : prices)
  • Do individuals who are 30 years of age or younger have more Facebook friends than individuals who are older than 30? (two groups : 30 or younger vs older, response variable : number of Facebook friends)
  • Do first-borns have higher GPA than non-first-borns? (two groups : first-borns vs non-first- borns, response variable : GPA)
  • Do sports teams win more goals/points when they play in home-field than away? (two groups : home games vs away games, response variable : number of goals/points)
  • Do magazines aimed at a male audience cost a different amount than magazines aimed at a female audience?
  • Do one-bedroom apartments in Hong Kong cost a different amount per month to rent than one-bedroom apartments in New York City?
  • Do certain food items cost a different amount at Cub Foods than they do at Byers?
  • Do individuals tend to send more e-mail on a daily basis than they receive?
  • Do running backs score more points in football games than wide receivers?
  • Do gas prices in the Hong Kong differ from gas prices outside of the Hong Kong?

You certainly do not have to use any of the above ideas, but we hope they will give you a good starting point. Look carefully at the ideas above in order to see just what kind of data we are hoping you will collect. As you can see in each idea, you are comparing two samples on the same response variable (and the response variable is always QUANTITATIVE/a number).

Be sure to keep in mind, as you are trying to come up with an idea, how you will be collecting your data. You want to be able to collect your data relatively quickly (since you will need to have that data by the end of Week 7 of the semester). Although gathering a random sample is preferable, the sample you obtain for your project DOES NOT need to be random since you will only have a short time to gather your data. Also, this data is only meant to be for a class assignment, so you should plan to gather data in an unobtrusive manner. This is not an assignment were you are expected to get IRB approval.

Before you begin collecting data, it is important that you first share your idea with your TA or me and get approval. When you share this information, please attempt to do so in an organized way so that it will be easy for the TAs to determine what you plan to do. For example, you should plan to submit a document in which you number or label each part (following the number scheme below). Answer the following five questions in your write-up that you will submit in BlackBoard.

  • Background and basic description of the dataset (5 points)

  • Hypothesis 1) (2 points) Tell us what your idea is and why you have chosen to pursue this idea. 2) (2 points) Carefully explain the following:

    1. What two groups you are comparing
    2. What you will be measuring (i.e., what your response variable will be)
    3. Is your response variable quantitative rather than categorical? 3) (2 points) Make a prediction about what kind of difference you expect to see between your samples and WHY. Note that when we gather data in statistics in order to compare samples, we often hypothesize that the groups will differ in some way. (You may have a good reason to believe that one group mean would be larger than the other vice versa. If not, you can simply hypothesize that two means are not the same.) 4) (2 points) Talk about how you will gather your data (e.g., will you go to certain websites to find your data, will you survey friends and classmates, will you to different stores to gather data, will you ignore sale prices and concentrate on price per ounce if you compare food items, etc.). 5) (2 points) If you had unlimited resources (time, money, staff, etc.) how would you collect your data described in Question 4?

Once the TAs approve your idea, you are free to begin collecting data.

Prepare your dataset

The second section of the assignment is to collect data and read the data into Python and list the groups you want to compare.

There is a lot of data online, and if you are trying to find something in particular, try searching for data online. You might type in the keyword “data”, “csv”, “github”, and then other keywords, depending on your interests. Just know that even though you may have an IDEAL project topic in mind that you’d like to pursue, you may not be able to find the appropriate kind of data in the time you have to work on this assignment. Thus, you should have a “Plan B” that will allow you to complete this assignment in a timely manner. The following examples maybe useful for you to collect data:

  • Github Raw CSV datasets
    • https://github.com/prasertcbs/basic-dataset
    • https://github.com/Opensourcefordatascience/Data-sets
  • Public datasets
    • https://github.com/awesomedata/awesome-public-datasets
    • https://data.world/datasets/csv
    • https://github.com/curran/data

You should then read your data into Python using the following command –

import pandas as pd

## (option 1; recommended) load csv data via github link

# Step 1: find the link of csv dataset
# Step 2: copy & paste the link into pd.read_csv(...)

df = pd.read_csv(<link_to_github_raw_data>)

## (option 2) load data by dowloading the csv

# Step 1: download the dataset (*.csv)
# Step 2: upload the csv file (in your local machine) to colab
# Step 3: copy the path of the csv file
# Step 4: load the data via pd.read_csv(path_to_csv)

df = pd.read_csv(<path_to_csv>)

Answer the following five questions in your write-up Jupyter notebook that you will submit in BlackBoard.

1) (3 points) Tell us what groups you want to compare in the dataset 2) (7 points) Print first 5 records of each group, respectively.