This week in Data Without Borders, we further explored data subsetting and querying in R. Here's what we covered:</P

  • How to ask R for any logical subset of rows from your data
  • Adding and removing columns from data frames
  • Knowing the difference between vectors and data frames
  • Plotting various views of your data using plot() as well as other graphing functions
  • R is great for subsetting, or targeting specific rows or columns in a data frame to create variables.

    For instance, for part of this week's assignment we were asked to write code to return the top three crimes each race is suspected of from the NYC stop-and-frisk dataset. (The dataset is available here). I wrote a for loop that cycles through each race according to each race's numerical code:

    The subsets are included in the variables. For instance subset.race is accessing the which.race variable, which is targeting the race column per each race.

    subset.race.crime.abbr is targeting the subset.race variable, and accessing the crime.abbr column (crime.abbr is a column filled with three-letter abbreviations of suspected crimes for each stop).

    As the for loop cycles through race.code, the top three suspected crimes are being printed from a table created from subset.race.crime.abbr, and indicated with [1:3]

    The print out looks something like this:

    Very exciting stuff! As you can see, most races have similar suspected crimes, with a few differences.

    We also did a preliminary analysis of whether police discriminate against obese individuals using logical subsets. (Note: we’re not actually proving this or generalizing from this dataset, we just want to see if more or fewer heavier people were arrested than lighter people in our dataset). We targeted obese individuals with a BMI of 30 or higher. The code:

    As you can see, based on our preliminary observations, there's only a 7.7% correlation between obesity and arrests.