Open In Colab

Working with Data

Describing Data using Pandas

In this first section, we will use the Pandas package to explore and describe data from O’Connell, K., Berluti, K., Rhoads, S. A., & Marsh, A. A. (2021). Reduced social distancing during the COVID-19 pandemic is associated with antisocial behaviors in an online United States sample. PLoS ONE.

This study assessed whether social distancing behaviors (early in the COVID-19 pandemic) was associated with self-reported antisocial behavior.

# Remember: Python requires you to explictly "import" libraries before their functions are available to use. We will always specify our imports at the beginning of each notebook.
import pandas as pd, numpy as np

Here, we will load a dataset as a pandas.DataFrame, and investigate its attributes. We will see N rows for each subject, and M columns for each variable.

Loading data

# here, we are just going to download data from the web
url = 'https://raw.githubusercontent.com/shawnrhoads/gu-psyc-347/master/docs/static/data/OConnell_COVID_MTurk_noPII_post_peerreview.csv'

# load data specified in `filename` into dataframe `df`
df = pd.read_csv(url)
# check type of df
type(df)
pandas.core.frame.DataFrame
# how many rows and columns are in df?
print(df.values.shape) # N x M
(131, 126)

Looks like we will have 131 rows (usually subjects, but can be multiple observations per subject) and 126 columns (usually variables)

# let's output the first 5 rows of the df
print(df.head())
   subID  mturk_randID  suspect_itaysisso        Country Region  \
0   1001          8797                  0  United States     CT   
1   1002          3756                  0  United States     IL   
2   1003          3798                  0  United States     OH   
3   1004          2965                  0  United States     TX   
4   1005          5953                  0  United States     NC   

                                        ISP loc_US       loc_state  \
0  AS7015 Comcast Cable Communications, LLC    Yes     Connecticut   
1                AS7018 AT&T Services, Inc.    Yes      California   
2        AS10796 Charter Communications Inc    Yes            Ohio   
3                AS7018 AT&T Services, Inc.    Yes           Texas   
4            AS20115 Charter Communications    Yes  North Carolina   

   loc_zipcode          loc_County  ...  education_4yr  STAB_total_centered  \
0         6511    New Haven County  ...              0            -3.946565   
1        90280  Los Angeles County  ...              0            39.053436   
2        44883       Seneca County  ...              0            40.053436   
3        77019       Harris County  ...              1            -9.946565   
4        28334      Sampson County  ...              0           -17.946566   

   STAB_total_min32  silhouette_dist_X_min81 silhouette_dist_X_inches  \
0                19                    441.0               110.332750   
1                62                    287.0                71.803856   
2                63                    313.0                78.308731   
3                13                    452.0               113.084820   
4                 5                    297.0                74.305733   

  violated_distancing STAB_rulebreak_rmECONOMIC  STAB_total_rmECONOMIC  \
0                   0                         9                     48   
1                   1                        24                     88   
2                   0                        23                     85   
3                   0                         8                     42   
4                   0                         8                     34   

   STAB_total_rmECONOMIC_centered  household_income_coded_centered  
0                       -2.076336                         1.269231  
1                       37.923664                        -3.730769  
2                       34.923664                        -2.730769  
3                       -8.076336                              NaN  
4                      -16.076336                        -2.730769  

[5 rows x 126 columns]

Creating custom DataFrames

We can also create our own dataframe. For example, here’s a dataframe containing 20 rows and 3 columns of random numbers.

sim_df = pd.DataFrame(np.random.randn(20, 3), index=range(0,20), columns=["column A", "column B", "column C"])
print(sim_df)
    column A  column B  column C
0  -1.523546  1.244712  1.281714
1   0.142749  0.326279 -1.028342
2  -0.356448  1.236416 -0.024497
3  -0.390965 -1.167132  1.169662
4   0.254325 -0.379238 -0.476311
5   0.717509 -0.204266 -0.484025
6  -0.637719 -0.336181  1.184558
7  -0.178774  1.829720 -1.561026
8  -0.400764  0.091366  1.227388
9  -1.894427 -1.312888  1.123745
10  1.013734  2.124048 -2.320404
11  1.797838  0.686147 -0.904062
12  0.041463 -0.680991  0.149429
13 -0.620775 -0.453663 -0.519136
14 -0.964090 -0.754053  0.260249
15  1.204661 -0.372964  0.257379
16 -0.102147 -1.428105  0.160668
17  1.527854 -1.262324  1.034834
18  1.003885 -1.704753 -0.415267
19  0.017227  0.114630 -0.494371

We can change the column names using list comprehension

# e.g., change to upper case
sim_df.columns = [x.upper() for x in sim_df.columns]
print(sim_df.head()) #display first 5 rows
   COLUMN A  COLUMN B  COLUMN C
0 -1.523546  1.244712  1.281714
1  0.142749  0.326279 -1.028342
2 -0.356448  1.236416 -0.024497
3 -0.390965 -1.167132  1.169662
4  0.254325 -0.379238 -0.476311
# e.g., change to last element in string
sim_df.columns = [x[-1] for x in sim_df.columns]
print(sim_df.tail()) #display last 5 rows
           A         B         C
15  1.204661 -0.372964  0.257379
16 -0.102147 -1.428105  0.160668
17  1.527854 -1.262324  1.034834
18  1.003885 -1.704753 -0.415267
19  0.017227  0.114630 -0.494371

Concatenating DataFrames

We can concatenate multiple dataframes containing the same columns (e.g., [‘A’,’B’,’C’]) using pd.concat(). This will stack rows across dataframes.

Usage:

pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
)
np.zeros((3,3))
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])
# create two new dataframes 

# the first will contain only zeros 
sim_df1 = pd.DataFrame(np.zeros((3, 3)), index=[20,21,22], columns=["A", "B", "C"])

# the second will contain only ones
sim_df2 = pd.DataFrame(np.ones((3, 3)), index=[23,24,25], columns=["A", "B", "C"])
sim_dfs = [sim_df, sim_df1, sim_df2] # as list of dfs
result = pd.concat(sim_dfs)
print(result)
           A         B         C
0  -1.523546  1.244712  1.281714
1   0.142749  0.326279 -1.028342
2  -0.356448  1.236416 -0.024497
3  -0.390965 -1.167132  1.169662
4   0.254325 -0.379238 -0.476311
5   0.717509 -0.204266 -0.484025
6  -0.637719 -0.336181  1.184558
7  -0.178774  1.829720 -1.561026
8  -0.400764  0.091366  1.227388
9  -1.894427 -1.312888  1.123745
10  1.013734  2.124048 -2.320404
11  1.797838  0.686147 -0.904062
12  0.041463 -0.680991  0.149429
13 -0.620775 -0.453663 -0.519136
14 -0.964090 -0.754053  0.260249
15  1.204661 -0.372964  0.257379
16 -0.102147 -1.428105  0.160668
17  1.527854 -1.262324  1.034834
18  1.003885 -1.704753 -0.415267
19  0.017227  0.114630 -0.494371
20  0.000000  0.000000  0.000000
21  0.000000  0.000000  0.000000
22  0.000000  0.000000  0.000000
23  1.000000  1.000000  1.000000
24  1.000000  1.000000  1.000000
25  1.000000  1.000000  1.000000

We can also concatenate using the rows (setting axis=1)

# the first will contain only zeros 
sim_df3 = pd.DataFrame(np.zeros((3, 3)), index=[1,2,3], columns=["A", "B", "C"])

# the second will contain only ones
sim_df4 = pd.DataFrame(np.ones((3, 3)), index=[2,4,5], columns=["D", "E", "F"])

result2 = pd.concat([sim_df3, sim_df4], axis=1)
print(result2)
     A    B    C    D    E    F
1  0.0  0.0  0.0  NaN  NaN  NaN
2  0.0  0.0  0.0  1.0  1.0  1.0
3  0.0  0.0  0.0  NaN  NaN  NaN
4  NaN  NaN  NaN  1.0  1.0  1.0
5  NaN  NaN  NaN  1.0  1.0  1.0

Notice that we have NaNs (not a number) in cells where there were no data (for example, no data in column D for index 1)

We have to be careful because all elements will not merge across rows and columns by default. For example, if the second df also had a column “C”, we will have two “C” columns by default.

# the first will contain only zeros 
sim_df5 = pd.DataFrame(np.zeros((3, 3)), index=[1,2,3], columns=["A", "B", "C"])

# the second will contain only ones
sim_df6 = pd.DataFrame(np.ones((3, 4)), index=[2,4,5], columns=["C", "D", "E", "F"])

result3a = pd.concat([sim_df5, sim_df6], axis=1)
print(result3a)
     A    B    C    C    D    E    F
1  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  1.0  1.0  1.0  1.0
5  NaN  NaN  NaN  1.0  1.0  1.0  1.0

By default, this method takes the “union” of dataframes. This is useful because it means no information will be lost!

But, now we have two columns named “C”. To fix this, we can have pandas rename columns with matching names using the DataFrame.merge() method.

result3b = sim_df5.merge(sim_df6, how='outer', left_index=True, right_index=True)
print(result3b)
     A    B  C_x  C_y    D    E    F
1  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  1.0  1.0  1.0  1.0
5  NaN  NaN  NaN  1.0  1.0  1.0  1.0

We could also take the “intersection” across the two dataframes. We can do that by setting join='inner' (Meaning we only keep the rows that are shared between the two). In the previous case, this would be row ‘2’. All columns would be retained.

result3c = pd.concat([sim_df5, sim_df6], axis=1, join='inner')
print(result3c)
     A    B    C    C    D    E    F
2  0.0  0.0  0.0  1.0  1.0  1.0  1.0

Manipulating DataFrames

We can also change values within the dataframe using list comprehension.

# first let's view the column "subID"
df['subID']
0      1001
1      1002
2      1003
3      1004
4      1005
       ... 
126    1160
127    1161
128    1162
129    1163
130    1164
Name: subID, Length: 131, dtype: int64
# now, let's change these value by adding a prefix 'sub_' and store in a new column called "subID_2"
df['subID_2'] = ['sub_'+str(x) for x in df['subID']]
df['subID_2']
0      sub_1001
1      sub_1002
2      sub_1003
3      sub_1004
4      sub_1005
         ...   
126    sub_1160
127    sub_1161
128    sub_1162
129    sub_1163
130    sub_1164
Name: subID_2, Length: 131, dtype: object

We can also grab specific elements in the dataframe by specifying rows and columns

print(df['age'][df['subID']==1001])
0    21
Name: age, dtype: int64
# if you know the index (row name), then you can use the `pd.DataFrame.loc` method
df.loc[0,'age']
21

Creating new columns is particularly useful for computing new variables from old variables. For example: for each subject, let’s multiply age by STAB_total.

for index,subject in enumerate(df['subID']):
    df.loc[index,'new_col'] = df.loc[index,'age'] * df.loc[index,'STAB_total'] 

We can extract a column of observations to a numpy array

sub_ids = df['subID'].values
print(sub_ids)
[1001 1002 1003 1004 1005 1007 1011 1013 1014 1015 1016 1017 1021 1022
 1023 1024 1026 1028 1029 1032 1033 1034 1035 1040 1043 1044 1045 1047
 1048 1049 1051 1053 1054 1056 1057 1058 1059 1060 1061 1062 1063 1064
 1066 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080
 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095
 1096 1098 1099 1100 1101 1102 1104 1105 1107 1108 1109 1110 1111 1112
 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126
 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1141
 1144 1145 1146 1147 1148 1149 1150 1151 1152 1154 1155 1156 1157 1158
 1160 1161 1162 1163 1164]
print(type(sub_ids))
<class 'numpy.ndarray'>

We can also transpose the dataframe

print(df.T.head())
                             0              1              2              3    \
subID                       1001           1002           1003           1004   
mturk_randID                8797           3756           3798           2965   
suspect_itaysisso              0              0              0              0   
Country            United States  United States  United States  United States   
Region                        CT             IL             OH             TX   

                             4              5              6              7    \
subID                       1005           1007           1011           1013   
mturk_randID                5953           8133           2500           7655   
suspect_itaysisso              0              0              0              0   
Country            United States  United States  United States  United States   
Region                        NC             MA             FL             WI   

                             8              9    ...            121  \
subID                       1014           1015  ...           1154   
mturk_randID                4616           2318  ...           1123   
suspect_itaysisso              0              0  ...              0   
Country            United States  United States  ...  United States   
Region                        MD             LA  ...             NC   

                             122            123            124            125  \
subID                       1155           1156           1157           1158   
mturk_randID                2455           7875           6828           3775   
suspect_itaysisso              0              0              0              0   
Country            United States  United States  United States  United States   
Region                        ID             CA             AZ             CO   

                             126            127            128            129  \
subID                       1160           1161           1162           1163   
mturk_randID                1459           7557           6617           5071   
suspect_itaysisso              0              0              0              0   
Country            United States  United States  United States  United States   
Region                        IL             WA             CA             KY   

                             130  
subID                       1164  
mturk_randID                7031  
suspect_itaysisso              0  
Country            United States  
Region                        MO  

[5 rows x 131 columns]

Statistics with DataFrames

We can compute all sorts of descriptive statistics on DataFrame columns using the following methods:

  • count(): Number of non-null observations

  • sum(): Sum of values

  • mean(): Mean of values

  • median(): Median of values

  • mode(): Mode of values

  • std(): Standard deviation of values

  • min(): Minimum value

  • max(): Maximum value

  • abs(): Absolute value

  • prod(): Product of values

  • cumsum(): Cumulative sum

  • cumprod(): Cumulative product

Here are some examples:

# mean of a column
df["age"].mean()
36.33587786259542
# mean of multiple columns
df[["age","STAB_total"]].mean()
age           36.335878
STAB_total    54.946565
dtype: float64
# median of a column
df["age"].median()
34.0
# compute a summary of metrics on columns
df[["age", "STAB_total"]].describe()
age STAB_total
count 131.000000 131.000000
mean 36.335878 54.946565
std 10.075569 30.095569
min 21.000000 32.000000
25% 29.500000 35.000000
50% 34.000000 42.000000
75% 41.000000 55.500000
max 65.000000 139.000000
# group means by sex
df.groupby("sex")[["age", "STAB_total"]].mean()
age STAB_total
sex
Female 38.264151 49.792453
Male 35.025641 58.448718
# group means by sex and education
df.groupby(["sex","education_coded"])[["age", "STAB_total"]].mean()
age STAB_total
sex education_coded
Female Four-year college degree (B.A., B.S.) 37.592593 54.074074
High school graduate or G.E.D. 42.400000 37.600000
Master's degree (for example, M.B.A., M.A., M.S.) 38.200000 55.800000
Some college 38.875000 48.875000
Two-year college degree (A.A., A.S.) 37.375000 40.125000
Male Doctoral or professional degree (for example, Ph.D., M.D., J.D., D.V.M., D.D.M) 31.500000 34.500000
Four-year college degree (B.A., B.S.) 35.125000 64.000000
High school graduate or G.E.D. 31.777778 63.888889
Master's degree (for example, M.B.A., M.A., M.S.) 39.000000 77.375000
Some college 34.368421 48.157895
Two-year college degree (A.A., A.S.) 36.750000 41.625000
# group counts by sex and education
df.groupby(["sex","education_coded"])[["age", "STAB_total"]].count()
age STAB_total
sex education_coded
Female Four-year college degree (B.A., B.S.) 27 27
High school graduate or G.E.D. 5 5
Master's degree (for example, M.B.A., M.A., M.S.) 5 5
Some college 8 8
Two-year college degree (A.A., A.S.) 8 8
Male Doctoral or professional degree (for example, Ph.D., M.D., J.D., D.V.M., D.D.M) 2 2
Four-year college degree (B.A., B.S.) 32 32
High school graduate or G.E.D. 9 9
Master's degree (for example, M.B.A., M.A., M.S.) 8 8
Some college 19 19
Two-year college degree (A.A., A.S.) 8 8

We can also correlate 2 or more variables

df[["age","STAB_total","socialdistancing"]].corr(method="spearman")
age STAB_total socialdistancing
age 1.000000 -0.145465 0.007075
STAB_total -0.145465 1.000000 -0.335639
socialdistancing 0.007075 -0.335639 1.000000

Visualizing Data using Matplotlib

To understand what our data look like, we will visualize it in different ways.

import matplotlib.pyplot as plt
%matplotlib inline

Let’s plot the distribiton of one variable in our data

plt.hist(df['age'], bins=9)
plt.xlabel("Age")
plt.ylabel("Number of Subjects")
Text(0, 0.5, 'Number of Subjects')
_images/module-01-02_Working-with-Data_62_1.png

Let’s see what percentage of subjects have a below-average score:

mean_age = np.mean(df['age'])
frac_below_mean = (df['age'] < mean_age).mean()
print(f"{frac_below_mean:2.1%} of subjects are below the mean")
63.4% of subjects are below the mean

We can also see this by adding the average score to the histogram plot:

plt.hist(df['age'], bins=9)
plt.xlabel("Age")
plt.ylabel("Number of Subjects")

plt.axvline(mean_age, color="orange", label="Mean Age")
plt.legend()
<matplotlib.legend.Legend at 0x174e572d908>
_images/module-01-02_Working-with-Data_66_1.png

Comparing mean and median

med_age = np.median(df['age'])

plt.hist(df['age'], bins=9)
plt.xlabel("Age")
plt.ylabel("Number of Subjects")

plt.axvline(mean_age, color="orange", label="Mean Age")
plt.axvline(med_age, color="black", label="Median Age")

plt.legend()
<matplotlib.legend.Legend at 0x174e5828b08>
_images/module-01-02_Working-with-Data_68_1.png