Sentiment Analysis for:
Harry Potter and the Philosopher’s Stone (1997)
Harry Potter and the Chamber of Secrets (1998)
Harry Potter and the Prisoner of Azkaban (1999)
Harry Potter and the Goblet of Fire (2000)
Harry Potter and the Order of the Phoenix (2003)
Harry Potter and the Half-Blood Prince (2005)
Harry Potter and the Deathly Hallows (2007)
Below is a sentiment analysis of the Harry Potter Series. It seeks to find answers to the questions:
It finds that:
Heroes and Villians Rankings
Among the heroes, surprisingly, Professor Slughorn had the most positive text around him with Luna Lovegood the second. Death Eaters contained the most negative sentiment. Unsurprisingly, the werewolf (Greyback), ancient snake (Basilisk), and giant spider (Aragog) all had extremely negative sentiment.
Heroes and Villians Across the Books
Negative sentiment in the text around the villians grew substantially as the story progress, while the sentiment in the text around the heroes stayed relatively stable. The Deathly Hallows was the book featuring the most negative sentiment.
Treatment of the Hogwarts Houses
As it should be, Hufflepuff was the house that has the most positive sentiment in the text around it.
Sentiment Across the Pages
The end of each book was negative until the final chapter, which always seems to end on a positive note - whether the heroes were victories or Harry’s friends were providing him some support over a more difficult ending.
rm(list = ls())
gc()
# install.packages('pacman') install.packages('remotes') library(remotes)
# install.packages('devtools') library(devtools)
# remotes::install_github('nrguimaraes/sentimentSetsR') library(magrittr)
# devtools::install_github('wch/webshot') webshot::install_phantomjs()
# library(BiocManager)
# BiocManager::install('https://bioconductor.org/biocLite.R')
# source('https://bioconductor.org/biocLite.R') biocLite('EBImage')
library(pacman)
pacman::p_load(devtools, knitr, magrittr, dplyr, ggplot2, rvest, sentimentSetsR,
caret, textTinyR, text2vec, tm, tidytext, stringr, stringi, SnowballC, stopwords,
wordcloud, prettydoc, cowplot, kable, utf8, corpus, glue, topicmodels, stm, wordcloud2,
htmlwidgets, viridis)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, warning = FALSE,
tidy = TRUE, results = "hold", cache = FALSE, dpi = 240, out.width = "75%")
# Custom functions Remove quotation marks
pasteNQ <- function(...) {
output <- paste(...)
noquote(output)
}
pasteNQ0 <- function(...) {
output <- paste0(...)
noquote(output)
}
# Percentages
propTab <- function(data, exclude = NULL, useNA = "no", dnn = NULL, deparse.level = 1,
digits = 0) {
round(table(data, exclude = exclude, useNA = useNA, dnn = dnn, deparse.level = deparse.level)/NROW(data) *
100, digits)
}
## Chart Template
Grph_theme <- function() {
palette <- brewer.pal("Greys", n = 9)
color.background = palette[2]
color.grid.major = palette[3]
color.axis.text = palette[6]
color.axis.title = palette[7]
color.title = palette[9]
theme_bw(base_size = 9) + theme(panel.background = element_rect(fill = color.background,
color = color.background)) + theme(plot.background = element_rect(fill = color.background,
color = color.background)) + theme(panel.border = element_rect(color = color.background)) +
theme(panel.grid.major = element_line(color = color.grid.major, size = 0.25)) +
theme(panel.grid.minor = element_blank()) + theme(axis.ticks = element_blank()) +
theme(legend.position = "none") + theme(legend.title = element_text(size = 11,
color = "black")) + theme(legend.background = element_rect(fill = color.background)) +
theme(legend.text = element_text(size = 9, color = "black")) + theme(strip.text.x = element_text(size = 9,
color = "black", vjust = 1)) + theme(plot.title = element_text(color = color.title,
size = 20, vjust = 1.25)) + theme(axis.text.x = element_text(size = 9, color = "black")) +
theme(axis.text.y = element_text(size = 9, color = "black")) + theme(axis.title.x = element_text(size = 10,
color = "black", vjust = 0)) + theme(axis.title.y = element_text(size = 10,
color = "black", vjust = 1.25)) + theme(plot.margin = unit(c(0.35, 0.2, 0.3,
0.35), "cm"))
}
## Clean Corpus
basicclean <- function(rawtext) {
# Set to lowercase
rawtext <- tolower(rawtext)
# Fix apostorphe
rawtext <- gsub("’", "'", rawtext)
# Remove contractions
fix_contractions <- function(rawtext) {
rawtext <- gsub("will not", "won't", rawtext)
rawtext <- gsub("can't", "cannot", rawtext)
rawtext <- gsub("can not", "cannot", rawtext)
rawtext <- gsub("shall not", "shant", rawtext)
rawtext <- gsub("n't", " not", rawtext)
rawtext <- gsub("'ll", " will", rawtext)
rawtext <- gsub("'re", " are", rawtext)
rawtext <- gsub("'ve", " have", rawtext)
rawtext <- gsub("'m", " am", rawtext)
rawtext <- gsub("'d", " would", rawtext)
rawtext <- gsub("'ld", " would", rawtext)
rawtext <- gsub("'ld", " would", rawtext)
rawtext <- gsub("'s", "", rawtext)
return(rawtext)
}
rawtext <- fix_contractions(rawtext)
# Strip whitespace
rawtext <- stripWhitespace(rawtext)
return(rawtext)
}
# Remove stop words
removestopwords <- function(rawtext, remove = NULL, retain = NULL) {
# Remove stop words
stopwords_custom <- stopwords::stopwords("en", source = "snowball")
stopwords_custom <- c(stopwords_custom, remove)
stopwords_retain <- retain
stopwords_custom <- stopwords_custom[!stopwords_custom %in% stopwords_retain]
rawtext <- removeWords(rawtext, stopwords_custom)
return(rawtext)
}
## Word Stemming
wordstem <- function(rawtext) {
# Stemming words
rawtext <- stemDocument(rawtext)
return(rawtext)
}
## Remove Non-Alpha
removenonalpha <- function(rawtext) {
# Remove puncutation, numbers, and other none characters
rawtext <- removePunctuation(rawtext)
rawtext <- removeNumbers(rawtext)
rawtext <- gsub("[^[:alnum:]///' ]", "", rawtext)
rawtext <- gsub("[']", "", rawtext)
return(rawtext)
}
# Remove JavaScript from WordClouds
library("EBImage")
embed_htmlwidget <- function(widget, rasterise = T) {
outputFormat = knitr::opts_knit$get("rmarkdown.pandoc.to")
if (rasterise || outputFormat == "latex") {
html.file = tempfile("tp", fileext = ".html")
png.file = tempfile("tp", fileext = ".png")
htmlwidgets::saveWidget(widget, html.file, selfcontained = FALSE)
webshot::webshot(html.file, file = png.file, vwidth = 700, vheight = 500,
delay = 10)
img = EBImage::readImage(png.file)
EBImage::display(img)
} else {
widget
}
}
Sentiment analysis is a subfield of natural language processing (NLP), which uses sentiment scores of individual words to describe positive language and negative language in a text. For this analysis, I parsed the text of each of the seven books into paragraph triplets, roughly averaging around 80-90 words. I then scored the words within those paragraph triplets to define the sentiment of these “text snapshots”.
Next, I organized the text by characters (heroes and villians) and then Hogwarts Houses to understand which had the most positive or negative sentiment in the 80-90 or so words around their mention. Finally, a looked at sentiment per page to see the “rollercoaster of emotions” that occurs during each Hogwarts school years.
setwd("C:/Users/siebe/Documents/07_Books/Harry Potter/")
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire",
"Order of the Phoenix", "Half-Blood Prince", "Deathly Hallows")
html <- c("Harry_Potter_and_the_Philosophers_Stone.html", "Harry_Potter_and_the_Chamber_of_Secrets.html",
"Harry_Potter_and_the_Prisoner_of_Azkaban.html", "Harry_Potter_and_the_Goblet_of_Fire.html",
"Harry_Potter_and_the_Order_of_the_Phoenix.html", "Harry_Potter_and_the_Half-Blood_Prince.html",
"Harry_Potter_and_the_Deathly_Hallows.html")
books <- tibble(Text = as.character(), Book = as.character())
para3 <- tibble(Text = as.character(), Book = as.character())
for (i in 1:7) {
rawtext <- read_html(html[i]) %>% html_nodes(xpath = "/html/body/p") %>% html_text(trim = TRUE)
wordcount <- sapply(strsplit(rawtext, " "), length)
paragraph <- rawtext[wordcount >= 3]
# Book Level Documents
books <- rbind(books, tibble(Text = str_c(paragraph, collapse = " "), Book = titles[i]))
# Paragraph Level Documents
triplet <- do.call(rbind, lapply(seq(1, length(paragraph), by = 3), function(x) tibble(Text = str_c(paragraph[x:(x +
2)], collapse = " "), Book = titles[i])))
para3 <- rbind(para3, triplet)
}
# Page Level Documents
pages <- books %>% unnest_tokens(Text, Text, strip_punct = T, to_lower = F) %>% group_by(Book,
Page = dplyr::row_number()%/%250) %>% dplyr::summarize(Text = stringr::str_c(Text,
collapse = " ")) %>% mutate(Page = dplyr::row_number()) %>% ungroup()
# Place books in chronolgical order
books$Book <- factor(books$Book, levels = titles)
para3$Book <- factor(para3$Book, levels = titles)
pages$Book <- factor(pages$Book, levels = titles)
In this sentiment analysis, I was interested in seeing which characters have the most positive and, conversely, negative sentiment in the text around them. Using paragraph triplets (averaging around 80-90 words), I observed the count of positive and negative words around select characters.
I created three character dictionaries to compare against each other. The first contains our main trio, who are mentioned in most of the text. The second contains the heroes (and organizations) outside of the main trio. The third contains villians of the story arch. Note, without mentioning spoilers, at least one of the villians arguably turns out to be a good character - but in terms of text spends most the story on the dark side.
# Create Dictionaries
trio <- c("harry", "ron", "hermione")
pasteNQ("Main Characters:")
paste(trio)
cat("\n")
heroes <- c("lily", "james", "hagrid", "dumbledore", "sirius", "lupin", "moody",
"slughorn", "dobby", "cedric", "luna", "tonks", "mcgonagall", "ginny", "order of the phoenix",
"neville")
pasteNQ("Number of Heroes:", length(heroes))
pasteNQ("Including:")
paste(heroes)
cat("\n")
villians <- c("voldemort", "nagini", "snape", "draco", "lucius", "umbridge", "pettigrew",
"dementor", "dementors", "greyback", "bellatrix", "quirrell", "riddle", "death eaters",
"aragog", "basilisk", "dudley", "vernon", "petunia")
pasteNQ("Number of Villians:", length(villians))
pasteNQ("Including:")
paste(villians)
[1] Main Characters:
[1] "harry" "ron" "hermione"
[1] Number of Heroes: 16
[1] Including:
[1] "lily" "james" "hagrid"
[4] "dumbledore" "sirius" "lupin"
[7] "moody" "slughorn" "dobby"
[10] "cedric" "luna" "tonks"
[13] "mcgonagall" "ginny" "order of the phoenix"
[16] "neville"
[1] Number of Villians: 19
[1] Including:
[1] "voldemort" "nagini" "snape" "draco" "lucius"
[6] "umbridge" "pettigrew" "dementor" "dementors" "greyback"
[11] "bellatrix" "quirrell" "riddle" "death eaters" "aragog"
[16] "basilisk" "dudley" "vernon" "petunia"
# Sentiment Scores
average <- function(x) {
pos <- sum(x[x > 0], na.rm = T)
neg <- sum(x[x < 0], na.rm = T) %>% abs()
neu <- length(x[x == 0])
bal <- ((pos - neg)/(pos + neg)) * 100
y <- ifelse(is.nan(bal), 0, bal %>% as.numeric())
return(y)
}
CleanText <- basicclean(para3$Text)
para3$Score <- sapply(CleanText, function(x) getSentiment(x, dictionary = "vader",
score.type = average))
para3$Cross <- NA
para3$Cross <- ifelse(para3$Score < 0, "Negative", para3$Cross)
para3$Cross <- ifelse(para3$Score > 0, "Positive", para3$Cross)
para3$Cross <- ifelse(para3$Score == 0, "Neutral", para3$Cross)
pasteNQ("Sentiment in Paragraphs (%)")
propTab(para3$Cross)
[1] Sentiment in Paragraphs (%)
Negative Neutral Positive <NA>
43 10 47 0
Surprisingly, Professor Slughorn has the most positive text around him. This perhaps was because Harry for the first time enjoyed and excelled in potions with him as teacher. Additionally, Harry spent a non-trival amount of time under the Felix Felicis potion while with Professor Slughorn.
Fan favorite, Luna Lovegood, has the second most positive text around her. This was not surprising as she has a quirky, upbeat personality.
Professor Moody had the most negative sentiment. This can be both due to his usual sour personality and that the majority of the time we were reading about him, it was villian Barty Crouch, Jr. in disguise.
Harry’s mother, Lily, unsurprisingly has the second most negative sentiment as she is mostly mentioned when Harry is thinking back about her death.
# Subset text mentioning a hero
heroes_sent <- data.frame(Hero = NA, Sentiment = NA)
# Find mean of sentiment scores for each hero and create ID of all heroes for
# total row
heroes_ID <- c()
for (string in heroes) {
findstr <- paste0("\\b", string, "\\b")
heroes_sent <- rbind(heroes_sent, c(string, para3 %>% as.data.frame() %>% .[grepl(findstr,
CleanText), "Score"] %>% average() %>% as.numeric()))
heroes_ID <- c(heroes_ID, grep(findstr, CleanText))
}
# Sort data
heroes_sent <- heroes_sent %>% .[order(heroes_sent$Sentiment %>% as.numeric()), ] %>%
tidyr::drop_na()
max_sent <- max(heroes_sent$Sentiment %>% as.numeric() %>% abs() + 1)
# Add hero total
heroes_ID <- unique(heroes_ID)
heroes_sent <- rbind(heroes_sent, c("average of heroes", para3 %>% as.data.frame() %>%
.[heroes_ID, "Score"] %>% average() %>% as.numeric()))
# Graph
heroes_sent %>% mutate(Hero = factor(Hero, levels = Hero)) %>% ggplot(aes(x = Hero,
y = Sentiment %>% as.numeric() %>% round(1), fill = Hero)) + geom_col() + coord_flip() +
Grph_theme() + ylab("Sentiment (-100% to 100%)") + xlab("") + ggtitle("Sentiment of Heroes") +
ylim(max_sent * -1, max_sent) + scale_fill_manual(values = c(viridis(nrow(heroes_sent) -
1), "black")) + guides(fill = FALSE)
It makes sense that Tom Riddle, Professor Quirrell, and perhaps even Professor Snape contain more positive text than negative text.
Professor Umbridge, surprisingly, has the most positive sentiment. As a character, she projects a “everything is fine and I’m here to help” while she seeks to torture the main trio.
I intentionally defined “demonator” and “demonators”, separately. In the plural, the contain high sentiment. However, the most interesting takeaway is that, in the singular, a demonator has more positive than negative text around it This actually makes sense as Harry must think of positive thoughts/memories to combant the villians that are the personification of depression.
Unsurprisingly, the werewolf, ancient snake, and giant spider all have extremely negative sentiment.
# Subset text mentioning a villian
villians_sent <- data.frame(Villian = NA, Sentiment = NA)
# Find mean of sentiment scores for each villian and create ID of all villians
# for total row
villians_ID <- c()
for (string in villians) {
findstr <- paste0("\\b", string, "\\b")
villians_sent <- rbind(villians_sent, c(string, para3 %>% as.data.frame() %>%
.[grepl(findstr, CleanText), "Score"] %>% average() %>% as.numeric()))
villians_ID <- c(villians_ID, grep(findstr, CleanText))
}
# Sort data
villians_sent <- villians_sent %>% .[order(villians_sent$Sentiment %>% as.numeric()),
] %>% tidyr::drop_na()
max_sent <- max(villians_sent$Sentiment %>% as.numeric() %>% abs() + 1)
# Add hero total
villians_ID <- unique(villians_ID)
villians_sent <- rbind(villians_sent, c("average of villians", para3 %>% as.data.frame() %>%
.[villians_ID, "Score"] %>% average() %>% as.numeric()))
# Graph
villians_sent %>% mutate(Villian = factor(Villian, levels = Villian)) %>% ggplot(aes(x = Villian,
y = Sentiment %>% as.numeric() %>% round(1), fill = Villian)) + geom_col() +
coord_flip() + Grph_theme() + ylab("Sentiment (-100% to 100%)") + xlab("") +
ggtitle("Sentiment of Villians") + ylim(max_sent * -1, max_sent) + scale_fill_manual(values = c(viridis(nrow(villians_sent) -
1), "black")) + guides(fill = FALSE)
Harry has less positive sentiment in the text around him then Ron and Hermione. This probably has a lot to do with his angsty stint in Order of the Phoenix (Book 5), where I think the caps lock on J.K. Rowling’s keyboard may have broke under heavy-use.
# Subset text mentioning a main character
trio_sent <- data.frame(Trio = NA, Sentiment = NA)
# Find mean of sentiment scores for each main character and create ID of all trio
# for total row
trio_ID <- c()
for (string in trio) {
findstr <- paste0("\\b", string, "\\b")
trio_sent <- rbind(trio_sent, c(string, para3 %>% as.data.frame() %>% .[grepl(findstr,
CleanText), "Score"] %>% average() %>% as.numeric()))
trio_ID <- c(trio_ID, grep(findstr, CleanText))
}
# Sort data
trio_sent <- trio_sent %>% .[order(trio_sent$Sentiment %>% as.numeric()), ] %>% tidyr::drop_na()
# Add main character total
trio_ID <- unique(trio_ID)
trio_sent <- rbind(trio_sent, c("average", para3 %>% as.data.frame() %>% .[trio_ID,
"Score"] %>% average() %>% as.numeric()))
# Graph
trio_sent %>% mutate(Trio = factor(Trio, levels = Trio)) %>% ggplot(aes(x = Trio,
y = Sentiment %>% as.numeric() %>% round(1), fill = Trio, label = Sentiment)) +
geom_col() + geom_text(aes(label = Sentiment %>% as.numeric() %>% round(1)),
vjust = -0.5) + Grph_theme() + ylab("Positive Sentiment (%)") + xlab("") + ggtitle("Sentiment of the Main Trio") +
ylim(0, 25) + scale_fill_manual(values = c("#FFC500", "#AAAAAA", "#7F0909", "black")) +
guides(fill = FALSE)
Here, I wanted to note the prevalance of the main trio, the heroes, and the villians. The main trio were mentioned in ~85% of each paragraph triplet. The heroes outside the main trio were in less than 50% of each paragraph triplet, while villians are much more rare (less than 30%).
This demonstrates that J.K. Rowling prefers to focus on the likeable characters and keeps the overall story arch, featuring the villians, to a minimum.
docs_prop <- tibble(Percentage = c(round(length(trio_ID)/nrow(para3) * 100, 1), round(length(heroes_ID)/nrow(para3) *
100, 1), round(length(villians_ID)/nrow(para3) * 100, 1)), Characters = c("Main Trio",
"Heroes", "Villians"))
colors <- c(Heroes = "#7F0909", Villians = "#0D6217", `Main Trio` = "#AAAAAA")
ggplot(docs_prop, mapping = aes(x = Characters, y = Percentage, fill = Characters,
label = Percentage)) + geom_col() + geom_text(aes(label = Percentage), vjust = -0.5) +
ggtitle("Characters as Percentage of Documents") + Grph_theme() + scale_fill_manual(values = colors) +
theme(legend.position = "right") + ylim(0, 100) + labs(x = "Books", y = "Percentage (%)",
color = "Legend")
In the chart below, I showed the average sentiment for the main trio, heroes, and villians as the series progresses. The main takeaways I saw are:
# Main Characters Sentiment
trio_grp <- para3[trio_ID, ] %>% mutate(Negative = recode(Cross, Positive = 0, Neutral = 0,
Negative = 1), Neutral = recode(Cross, Positive = 0, Neutral = 1, Negative = 0)) %>%
group_by(Book) %>% summarize(`Trio Sentiment` = mean(Negative, na.rm = T) * 100)
# Heroes Sentiment
heroes_grp <- para3[heroes_ID, ] %>% mutate(Negative = recode(Cross, Positive = 0,
Neutral = 0, Negative = 1), Neutral = recode(Cross, Positive = 0, Neutral = 1,
Negative = 0)) %>% group_by(Book) %>% summarize(`Heroes Sentiment` = mean(Negative,
na.rm = T) * 100)
# Villians Sentiment
villians_grp <- para3[villians_ID, ] %>% mutate(Negative = recode(Cross, Positive = 0,
Neutral = 0, Negative = 1), Neutral = recode(Cross, Positive = 0, Neutral = 1,
Negative = 0)) %>% group_by(Book) %>% summarize(`Villians Sentiment` = mean(Negative,
na.rm = T) * 100)
# Plot
df <- inner_join(heroes_grp, villians_grp, by = "Book") %>% as_tibble()
df <- inner_join(trio_grp, df, by = "Book")
colors <- c(Heroes = "#7F0909", Villians = "#0D6217", `Main Trio` = "#AAAAAA")
ggplot(df, mapping = aes(x = Book, group = NA)) + geom_line(aes(y = `Trio Sentiment`,
color = "Main Trio"), size = 3) + geom_line(aes(y = `Heroes Sentiment`, color = "Heroes"),
size = 1.5) + geom_line(aes(y = `Villians Sentiment`, color = "Villians"), size = 1.5) +
ggtitle("Sentiment of Heroes/Villians Across Books") + Grph_theme() + scale_color_manual(values = colors) +
theme(axis.text.x = element_text(size = 8, angle = 45)) + theme(legend.position = "right") +
ylim(0, 60) + labs(x = "Books", y = "Negative Sentiment (%)", color = "Legend")
Ok, I needed to know the sentiment surrounding the four Howgarts Houses: Gryffindor, Slytherin, Ravenclaw and Hufflepuff.
# Create Dictionaries
houses <- c("gryffindor", "slytherin", "ravenclaw", "hufflepuff")
pasteNQ("Houses:")
paste(houses)
[1] Houses:
[1] "gryffindor" "slytherin" "ravenclaw" "hufflepuff"
…and the best house, Hufflepuff!, easily contained the most positive sentiment.
# Subset text mentioning a main character
houses_sent <- data.frame(Houses = NA, Sentiment = NA)
# Find mean of sentiment scores for each main character and create ID of all
# houses for total row
houses_ID <- c()
for (string in houses) {
findstr <- paste0("\\b", string, "\\b")
houses_sent <- rbind(houses_sent, c(string, para3 %>% as.data.frame() %>% .[grepl(findstr,
CleanText), "Score"] %>% average() %>% as.numeric()))
houses_ID <- c(houses_ID, grep(findstr, CleanText))
assign(paste0(string, "_ID"), grep(findstr, CleanText))
}
# Sort data
houses_sent <- houses_sent %>% .[order(houses_sent$Sentiment %>% as.numeric()), ] %>%
tidyr::drop_na()
max_sent <- max(houses_sent$Sentiment %>% as.numeric() %>% abs() + 1)
# Add main character total
houses_ID <- unique(houses_ID)
houses_sent <- rbind(houses_sent, c("average", para3 %>% as.data.frame() %>% .[houses_ID,
"Score"] %>% average() %>% as.numeric()))
# Graph
houses_sent %>% mutate(Houses = factor(Houses, levels = Houses)) %>% ggplot(aes(x = Houses,
y = Sentiment %>% as.numeric() %>% round(1), fill = Houses, label = Sentiment)) +
geom_col() + geom_text(aes(label = Sentiment %>% as.numeric() %>% round(1)),
vjust = -0.5) + Grph_theme() + ylab("Positive Sentiment") + xlab("") + ggtitle("Sentiment of Houses") +
ylim(0, max_sent) + scale_fill_manual(values = c("#0D6217", "#7F0909", "#000A90",
"#FFC500", "black")) + guides(fill = FALSE)
It seemed that the houses were rarely mentioned. Gryffindor was easily mentioned the most, but was only mentioned in roughly 4% of the text. Slytherin was mentioned in a little over 2% of the text. Poor Hufflepuff and Ravenclaw were mentioned less than 1% of the text.
docs_prop <- tibble(Percentage = c(round(length(gryffindor_ID)/nrow(para3) * 100,
1), round(length(slytherin_ID)/nrow(para3) * 100, 1), round(length(ravenclaw_ID)/nrow(para3) *
100, 1), round(length(hufflepuff_ID)/nrow(para3) * 100, 1)), Houses = c("Gryffindor",
"Slytherin", "Ravenclaw", "Hufflepuff"))
colors <- c(Gryffindor = "#7F0909", Slytherin = "#0D6217", Ravenclaw = "#000A90",
Hufflepuff = "#FFC500")
ggplot(docs_prop, mapping = aes(x = Houses, y = Percentage, fill = Houses, label = Percentage)) +
geom_col() + geom_text(aes(label = Percentage), vjust = -0.5) + ggtitle("Houses as Percentage of Documents") +
Grph_theme() + scale_fill_manual(values = colors) + theme(legend.position = "right") +
ylim(0, 5) + labs(x = "Books", y = "Percentage (%)", color = "Legend")
For fun, here are the most specific words (using TF-IDF scores) to describe the four Houses.
set.seed(20)
para3$ID <- row.names(para3)
para3$CleanText <- para3$Text %>% basicclean() %>% removestopwords(remove = "said") %>%
removenonalpha()
# TFIDF Scores
TFIDF <- para3[houses_ID, ] %>% unnest_tokens(word, CleanText) %>% anti_join(get_stopwords()) %>%
count(ID, word, sort = T) %>% bind_tf_idf(word, ID, n) %>% arrange(desc(tf_idf))
Gryffindor_d <- TFIDF[TFIDF$ID %in% gryffindor_ID, ]
Slytherin_d <- TFIDF[TFIDF$ID %in% slytherin_ID, ]
Ravenclaw_d <- TFIDF[TFIDF$ID %in% ravenclaw_ID, ]
Hufflepuff_d <- TFIDF[TFIDF$ID %in% hufflepuff_ID, ]
# Gryffindor
Gryffindor <- Gryffindor_d %>% anti_join(Slytherin_d) %>% anti_join(Ravenclaw_d) %>%
anti_join(Hufflepuff_d)
cloud <- data.frame(words = rbind(Gryffindor[1:20, "word"], "GRYFFINDOR"), weight = rbind(Gryffindor[1:20,
"tf_idf"], 0.8))
wc_1 <- wordcloud2(cloud, size = 0.4, color = "random-light", backgroundColor = "black")
# Slytherin
Slytherin <- Slytherin_d %>% anti_join(Gryffindor_d) %>% anti_join(Ravenclaw_d) %>%
anti_join(Hufflepuff_d)
cloud <- data.frame(words = rbind(Slytherin[1:20, "word"], "SLYTHERIN"), weight = rbind(Slytherin[1:20,
"tf_idf"], 0.8))
wc_2 <- wordcloud2(cloud, size = 0.5, color = "random-light", backgroundColor = "black")
# Ravenclaw
Ravenclaw <- Ravenclaw_d %>% anti_join(Slytherin_d) %>% anti_join(Gryffindor_d) %>%
anti_join(Hufflepuff_d)
cloud <- data.frame(words = rbind(Ravenclaw[1:20, "word"], "RAVENCLAW"), weight = rbind(Ravenclaw[1:20,
"tf_idf"], 0.8))
wc_3 <- wordcloud2(cloud, size = 0.4, color = "random-light", backgroundColor = "black")
# Hufflepuff
Hufflepuff <- Hufflepuff_d %>% anti_join(Slytherin_d) %>% anti_join(Ravenclaw_d) %>%
anti_join(Gryffindor_d)
cloud <- data.frame(words = rbind(Hufflepuff[1:20, "word"], "HUFFLEPUFF"), weight = rbind(Hufflepuff[1:20,
"tf_idf"], 0.8))
wc_4 <- wordcloud2(cloud, size = 0.45, color = "random-light", backgroundColor = "black")
embed_htmlwidget(wc_1)
embed_htmlwidget(wc_2)
embed_htmlwidget(wc_3)
embed_htmlwidget(wc_4)
At the beginning of this document, I defined page the level as containing 250 words. I used the page level to observe how sentiment changed over the pages of each book.
# Sentiment Scores
average <- function(x) {
pos <- sum(x[x > 0], na.rm = T)
neg <- sum(x[x < 0], na.rm = T) %>% abs()
neu <- length(x[x == 0])
bal <- ((pos - neg)/(pos + neg)) * 100
y <- ifelse(is.nan(bal), 0, bal %>% as.numeric())
return(y)
}
CleanText <- basicclean(pages$Text)
pages$Score <- sapply(CleanText, function(x) getSentiment(x, dictionary = "vader",
score.type = average))
pages$Cross <- NA
pages$Cross <- ifelse(pages$Score < 0, "Negative", pages$Cross)
pages$Cross <- ifelse(pages$Score > 0, "Positive", pages$Cross)
pages$Cross <- ifelse(pages$Score == 0, "Neutral", pages$Cross)
pasteNQ("Sentiment in Pages (%)")
propTab(pages$Cross)
[1] Sentiment in Pages (%)
Negative Neutral Positive
46 0 54
Finally, we can view sentiment per page to see the rollercoaster of emotions throughout the books. The main takeaways that I saw are:
# Grouping by pages for the x-axis
pages %>% group_by(Book) %>% mutate(Paragraph = row_number()) %>% ggplot(aes(Page,
Score, color = Book)) + Grph_theme() + theme(panel.background = element_rect(fill = "#FFFFFF",
color = "#FFFFFF")) + geom_line(alpha = 0.3, show.legend = FALSE) + geom_smooth(method = lm,
formula = y ~ splines::bs(x, 15), se = F, show.legend = FALSE) + geom_line(y = 0) +
scale_color_manual(values = c("#946B2D", "#0D6217", "#000A90", "#AAAAAA", "#000000",
"#7F0909", "#FFC500")) + ylim(-100, 100) + labs(x = "Page Number", y = "Sentiment (-100% to 100%)") +
facet_wrap(~Book, ncol = 2, scales = "free_x")
This was an analysis of the series’ tone. Previous analysis looked more broadly at the vocab, word distributions, and major themes/settings. Subsequent analysis will model comparisons of the books to the films (upcoming).