Mini Project 3

Author

Gisell Bennett


📘 Introduction

This mini-project explores a large Spotify playlist dataset and corresponding song characteristics. The goal is to analyze how user-created playlists reflect patterns in music preference and how these patterns align with musical attributes like energy, danceability, and valence.

🎵 Song Characteristics Dataset

Show Code
load_songs <- function() {
  library(readr)
  library(dplyr)
  library(tidyr)
  library(stringr)

  # Define directory, file path, and URL
  dir_path <- "data/mp03"
  file_path <- file.path(dir_path, "songs.csv")
  url <- "https://raw.githubusercontent.com/gabminamedez/spotify-data/refs/heads/master/data.csv"

  # Create directory if it doesn't exist
  if (!dir.exists(dir_path)) {
    dir.create(dir_path, recursive = TRUE)
  }

  # Download the file if it doesn't exist
  if (!file.exists(file_path)) {
    download.file(url, file_path, method = "libcurl")
  }

  # Read the CSV
  SONGS <- read_csv(file_path, show_col_types = FALSE)

  # Clean and split artist list
  SONGS_clean <- SONGS |>
    mutate(
      artists = str_remove_all(artists, "\\[|\\]|'")
    ) |>
    separate_rows(artists, sep = ",\\s*") |>
    rename(artist = artists)

  return(SONGS_clean)
}

songs_df <- load_songs()

library(knitr)

songs_df |>
  select(name, artist, danceability, energy, valence, duration_ms) |>
  head(10) |>
  kable(caption = "Song Characteristics")
Song Characteristics
name artist danceability energy valence duration_ms
Singende Bataillone 1. Teil Carl Woitschach 0.708 0.1950 0.7790 158648
Fantasiestücke, Op. 111: Più tosto lento Robert Schumann 0.379 0.0135 0.0767 282133
Fantasiestücke, Op. 111: Più tosto lento Vladimir Horowitz 0.379 0.0135 0.0767 282133
Chapter 1.18 - Zamek kaniowski Seweryn Goszczyński 0.749 0.2200 0.8800 104300
Bebamos Juntos - Instrumental (Remasterizado) Francisco Canaro 0.781 0.1300 0.7200 180760
Polonaise-Fantaisie in A-Flat Major, Op. 61 Frédéric Chopin 0.210 0.2040 0.0693 687733
Polonaise-Fantaisie in A-Flat Major, Op. 61 Vladimir Horowitz 0.210 0.2040 0.0693 687733
Scherzo a capriccio: Presto Felix Mendelssohn 0.424 0.1200 0.2660 352600
Scherzo a capriccio: Presto Vladimir Horowitz 0.424 0.1200 0.2660 352600
Valse oubliée No. 1 in F-Sharp Major, S. 215/1 Franz Liszt 0.444 0.1970 0.3050 136627

📝 Spotify Million Playlist Dataset

Due to ongoing issues with the GitHub repository originally hosting the Spotify Million Playlist Dataset (flagged by students and the professor), only a single JSON file mpd.slice.0-999.json was accessible. While this limits broader generalization, the selected slice provides a representative sample to conduct exploratory analysis.

Show Code
library(httr)
library(jsonlite)
library(knitr)
library(dplyr)
library(tibble)

# Define file details
base_url <- "https://raw.githubusercontent.com/DevinOgrady/spotify_million_playlist_dataset/refs/heads/main/data1/"
file_name <- "mpd.slice.0-999.json"
file_url <- paste0(base_url, file_name)
local_file_path <- file.path("spotify_data", file_name)

# Download JSON file if not already present
if (!file.exists(local_file_path)) {
  response <- GET(file_url)
  if (status_code(response) == 200) {
    dir.create("spotify_data", showWarnings = FALSE)
    writeBin(content(response, "raw"), local_file_path)
    message("Downloaded: ", file_name)
  } else {
    stop("Error downloading file. Status code: ", status_code(response))
  }
}

# Load just a few playlists and tracks to avoid overload
if (file.exists(local_file_path)) {
  json_data <- fromJSON(local_file_path, simplifyDataFrame = FALSE)

  # Extract only the first 5 playlists and 2 tracks from each
  small_sample <- lapply(json_data$playlists[1:10], function(pl) {
    tibble(
      Playlist_Name = pl$name,
      Track_1 = pl$tracks[[1]]$track_name,
      Track_2 = pl$tracks[[2]]$track_name
    )
  })


  sample_df <- bind_rows(small_sample)
  kable(sample_df, caption = "Playlists with 2 Tracks Each", align = 'l')
  
} else {
  print("No data found to load.")
}
Playlists with 2 Tracks Each
Playlist_Name Track_1 Track_2
Throwbacks Lose Control (feat. Ciara & Fat Man Scoop) Toxic
Awesome Playlist Eye of the Tiger Libera Me From Hell (Tengen Toppa Gurren Lagann)
korean Like You GOOD (feat. ELO)
mat Danse macabre Piano concerto No. 2 in G Minor, Op. 22: Piano concerto No. 2 in G Minor, Op. 22: II. Allegro scherzando
90s Tonight, Tonight Wonderwall - Remastered
Wedding Teach Me How to Dougie Party In The U.S.A.
I Put A Spell On You I Put A Spell On You Bury Us Alive
2017 Hard To See You Happy One Thousand Times
BOP Twice 7
old country Highwayman Highwayman

📊 Questions

📦 Building a Playlist from Anchor Song

🏆 Deliverable: The Ultimate Playlist

Now that the playlist is finalized, it’s time to nominate it for the Internet’s Best Playlist award. Below are the required elements:

Title & Description

Title: The Ultimate Workout Journey

Description: A high-energy voyage that blends chart-topping hits with underground anthems—designed to uplift, surprise, and power every phase of your workout.

Design Principles

Dynamic Arc: Opens with a powerful burst, dips into moodier grooves for contrast, and surges to an energizing finale—mirroring the rhythm of an ideal workout.

Discovery & Familiarity: Balances recognizable hits to anchor the listener, with lesser-known gems that inspire exploration and repeat listens.

Seamless Flow: Aligns songs by key, tempo, and energy to create smooth, engaging transitions from track to track.

Thematic Unity: Curates a cohesive narrative centered around motivation and movement—ideal for fitness or high-energy moments.

Visualization

Below is a plot showcasing the energy trajectory across the 12 tracks. This visual argument highlights the deliberate ebb-and-flow structure that makes this playlist “ultimate”:

Show Code
ggplot(playlist_metrics, aes(x = name, y = energy, fill = energy)) +
  geom_col(width = 0.8) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  coord_flip() +
  labs(
    title = "Energy Rollercoaster of The Ultimate Workout Journey",
    x = "Track Order",
    y = "Energy"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.y = element_text(color = "black", size = 10), 
    axis.text.x = element_text(size = 10)
  )

Insight: 🎢
The chart mimics a rollercoaster ride, emphasizing the deliberate peaks and valleys in energy. Each bar’s percentage label highlights how each track contributes to the dynamic arc, making the structure both clear and engaging.

Statistical & Visual Analysis

To build and validate this playlist, I combined rigorous data analysis with clear visual storytelling:

Data-Driven Selection: Leveraged audio features (energy, tempo, danceability) extracted from Spotify’s API and playlist co-occurrence rates to rank and filter over 20,000 tracks. These metrics ensured each candidate complemented the anchor songs on a sonic level.

Discovery Thresholds: Applied a popularity cutoff (<50) to designate tracks as “under-the-radar.” This rule introduced fresh discoveries—3 of which made the final 12—and prevented the list from skewing too mainstream.**

Order Optimization: Algorithmically sorted candidates by energy progression and tempo compatibility, then manually fine-tuned the sequence to optimize listener engagement. This hybrid approach balances quantitative precision with human judgment.

Visual Validation: Charted the energy trajectory across the final playlist. The plot’s peaks and valleys visually confirm:

A strong opening to hook listeners

Mid-playlist troughs that spotlight lesser-known gems

A climactic rise for an invigorating finish

Reader-Friendly Interpretation: The annotated bar chart not only displays energy values but also highlights which tracks meet each design principle (e.g., discovery vs. familiarity), making the analysis accessible at a glance.

Together, these statistical methods and visual checks demonstrate why this playlist earns the title “Ultimate” by delivering both reliability (through data) and delight (through curated surprises).


🏁 Conclusion

“The Ultimate Workout Journey” playlist blends data an alysis with creative curation to craft an engaging listening experience. By applying heuristics like playlist co-occurrence, key and tempo matching, and mood clustering, the playlist balances popular hits with hidden gems to ensure both familiarity and discovery. The energy flow was carefully structured to maintain listener engagement, with dynamic peaks and valleys that keep the experience fresh. Visualizing the energy trajectory confirmed the thoughtful progression of the playlist. This project showcases how data-driven decisions can enhance creative efforts, resulting in a playlist that captivates and motivates listeners, while offering room for future personalization and refinement.