Clustering litter to collection objects
logo

Welcome

The purpose of this analysis is to look at the clustering of litter for the cleanup done during the Startup Grind 2020 event, by the Rubbish, co. team. We can use clustering to evaluate the placement of collection objects (i.e, trash cans, etc.) and possibly improve on the cleanup policies. We can also use this to gain insights on the habits of people who throw trash on the ground!

This Analysis uses R vizualization and Python scripting together, utilizing the reticulate package. The data was already cleaned using R, and the script to do so can be found as clean_script.R. A Python function was also created for the clustering algorithm, which is located in euclidean_script.py. Both files can be found in my GitHub Repository for this analysis.

This write-up will not be discussing the impacts of litter type and litter production during an event. If you want to see a very well done analysis of the mentioned attributes, please visit this blog post by Rubbish’s Founder, Emin Israfil.

A little about who Rubbish, co. is

Rubbish, co. is a fresh startup based in California. They recently created the Rubbish iOS Application where you can clean litter and track it! The app allows users to take pictures of the collected item, while simultaniously tracking where the item was collected. The user then self annotates what classification the item belonds to (such as, plastic, paper, tobacco, etc.). They have even created a Rubbish Beam, which is a trash picker-upper dedicated to automatically taking pictures and tracking the litter, making this process easier and faster for the user!

About Event and Cleanup

Rubbish partnered with Startup Grind and the Fox Theatre Redwood City Team (Team Fox) with a goal to make Startup Grind Global 2020 the first conference in the nation to audit their litter and quantify their litter footprint.

They started the litter audit with a baseline cleanup on Sunday where they found a measly 392 pieces of litter. Monday was the Opening Night Event, and the Rubbish team had gone out to cleanup after the event to find 613 pieces of litter. On Day 1 of the event, Tuesday, Rubbish did another cleanup run after the event where they found 868 pieces of litter. On the second day of the event, Wednesday, they again went out to collect 564 pieces of litter after the days activities.

How much litter was cleaned?

Lets start with getting a baseline for the volume of litter collected on each of our four days.

From this we can see a clear increase in litter pickup for Tuesday and Wednesday. The higher volume of collected litter is expected as the actual event took place over these two days. However, we see a drastic spike in paper collected on Tuesday. These are most likely made up by event fliers being handed out and later dropped on the floor. In total, we know that 2437 objects were collected, and there were a total of 42 tagged collection objects throughout all four days.

What about Collection Objects?

Collection Objects are defined to be Trash Cans, Recycling Cans, and Tobacco Ash Cans.

For our clustering algorithm, we will define the allowable items into each collection object.

  • i.e, only tobacco products can be clustered into Tobacco Ash Cans, recycling products into Recyclying Cans, and everything is allowable into Trash Cans.

From here, we will cycle through each tracked item and find its closest allowable collection point, via straight line distance with an adjustment for longitude and latitude. This item will then be considered part of that collection points cluster.

This script was created in Python and saved as euclidean_script.py, which can be found in my GitHub repository for this analysis. The Python script is being used directly in R as a callable function using the Reticulate package! We could have also created the function directly in the R Markdown file; however, it is much easier to just import it directly as a function. So lets cluster our items!

What do our clusters look like?

After applying our clustering algorithm with a Python script, we find the following clusters.

From the above scatter plot we see that the black dots corrospond to the collection objects. Our colored dots are the objects the Rubbish Team collected throughout all 4 days. The larger cluster of black dots are collection objects located inside the bathrooms in the building. This throws things off a bit, and in the future I would suggest Rubbish to distinguish which collection objects are located inside bathrooms, so that we can aggregate the results. However, since there is no inclination of which are inside or outside of the bathroom, they will all be treated as they are given.

For the clusterings, we actually see pretty good groupings! Although it is clear the placements of the collection objects could be improved to better cover the areas defined. Unfortunately, without any idea of how full these collection objects were, there would not be anything fruitful gained from trying to find better locations for the collection objects. As it is unclear if clusters with a high volume of litter is due to high foot traffic, or if it is due to an overfilled collection object.

Did the Rubbish Team have an effect?

To see if the Rubbish team had an effect on the clustering, we should look where they were focused for each hour. To do this, a streamgraph gives us the best representation. The overall length of each block gives us the total number of litter pieces picked up and tracked by the rubbish team!

This graph represents the amount of litter collected per hour, for each collection object.

Each seperation of color corrosponds to a different collection object. Colors are repeated as there are too many collection objects to create distinct colors.

From the above we can decipher when the Rubbish team was at the location in cleanup-mode in full force. We see drastic spikes in the volume of collected litter. This could mean that litter collection was not consistent across the whole event, meaning they cleaned in certain time segments instead of during the entire event. However, this could also mean there was generally less litter during the times of lower volumes. After talking to the Rubbish team, the reality is they did their cleaning after the event was coming to a close. My suggestion would be to focus on cleaning during the entire event, in order to lessen the bias on a time-scale. Due to this bias in collection times, any analysis on the litter produced over time would have a major bias and could easily misrepresent reality.

Another point is we can see larger volumes of litter around certain collection objects during these timeblocks. This could be due to a larger build-up over litter around these areas, or it could be due to the Rubbish team lingering and favoring certain collection object locations. It is difficult to tell from this graph, so to find how the team collected their litter we can create a time map of all the litter they collected!

Viewing the collection over time

To get an idea of how well the RUbbish team traversed the event space, we can look at the collection of litter one point at a time.

From the above gif we can see how rubbish well the Rubbish team moved around the area to collect the litter. If you look closesly, you can even tell how many people were picking up litter at the same time.

It seems there is some lingering around some areas; however, this could have easily been due to a higher volume of litter in these areas. The Rubbish team also did not pick up any litter in the lower left area, yet collection objects were tagged there.

Well, how close was the litter to Collection Objects?

One important feature is finding how close litter was, on average, to its closest collection point. This feature could point clues to how lazy humans are! If the average distance to the collection object is low, then that means people are pretty lazy. However, if the average distance is large, this could mean that a collection object was not close enough for the person to reasonably throw away their litter. Of course, we should all just hold onto our trash a little longer to actually throw it away.

One important note is that this has an inherent bias. We, again, do not know how full these collection objects are. Meaning, people could have looked for a appropriate collection object; however, did not find any with enough room to throw away their trash. Or they might have even thrown their trash away, but because they were full the trash had fallen out of its collection object.

# prepping data for heatmap plot
plot_data <- clusters %>%
  mutate(
    closest_cent = paste(ifelse(cent_type == "trashCan", "Trash:",
                                ifelse(cent_type == "recyclingCan", "Recy:", "Ash:")),
                         ifelse(closest_cent<10, paste("0",closest_cent, sep=""), closest_cent),
                         sep= " ")
  ) %>% 
  subset(select = c(day, closest_cent, mean_dist)) %>% 
  dcast(closest_cent ~ day, value.var = "mean_dist") %>% 
  mutate( # chaning na values to 0
    Sunday = ifelse(is.na(Sunday),0,Sunday)
    ,Monday = ifelse(is.na(Monday),0,Monday)
    ,Tuesday = ifelse(is.na(Tuesday),0,Tuesday)
    ,Wednesday = ifelse(is.na(Wednesday),0,Wednesday)
  )

# changing rownames to centroid id
rownames(plot_data) <- plot_data[,"closest_cent"]

# making dataframe into matrix
plot_data <- plot_data %>% 
  subset(select = -c(closest_cent)) %>% 
  as.matrix()

# colors for graph
colors <- brewer.pal(9, "RdPu")[c(1, 6:9)]
colors[1] <- "#ffffff"

# row clustering order
row_order <- plot_data %>% 
  dist(method = "euclidean") %>% 
  hclust(method = "complete") %>% 
  as.dendrogram() %>% 
  rev()


# plotting heatmap
plot_data %>% 
  heatmaply(
          plot_method = "plotly"
          ,colors = colorRampPalette(colors)
          ,dendogram = "both"
          ,show_dendrogram = c(FALSE, FALSE)
          ,label_names = c("Day", "Collection ID", "Mean Distance")
          ,grid_color = "white"
          ,main = "Mean Distance of Litter from the (Trash / Recyclying / Ash) Can"
          #ylab = "Collection Objects (ID)",
          ,xlab = "A distance of 0 means there are no objects around the Collection Object."
          ,key.title = "meters"
          ,showticklabels = c(TRUE, TRUE)
          ,column_text_angle = 0
          ,colorbar_len = .8
          ,grid_gap = 1
          ,Rowv = row_order
          ,Colv = clustered_data$day %>% unique()
          # ,cellnote = plot_data
          ) %>% 
  layout(width=800)

From the above we can see that most collection objects were fairly consistent with its average distance of surrounding litter. We see that some collection objects, namely the top ID’s, having a large distance, while the lower part of the graph has a very low average distance. These disparities could be due to human laziness, the “fullness” of the collection object, or the amount of foot-traffic in the area.

Along with this, we found 6 out of our total 42 collection objects to have a mean distance of more than 20 meters, when summing the distance across all four days. This leads us to believe that at least 6 of these collection objects could be moved to better locations to accomodate the events participants.

But what about the number of items per cluster?

To get an idea of this, let’s take the exact same heatmap, but look at the count of items found near each collection object.

# prepping data for heatmap plot
plot_data <- clusters %>%
  mutate(
    closest_cent = paste(ifelse(cent_type == "trashCan", "Trash:",
                                ifelse(cent_type == "recyclingCan", "Recy:", "Ash:")),
                         ifelse(closest_cent<10, paste("0",closest_cent, sep=""), closest_cent),
                         sep= " ")
  ) %>% 
  subset(select = c(day, closest_cent, num_litter)) %>% 
  dcast(closest_cent ~ day, value.var = "num_litter") %>% 
  mutate( # chaning na values to 0
    Sunday = ifelse(is.na(Sunday),0,Sunday)
    ,Monday = ifelse(is.na(Monday),0,Monday)
    ,Tuesday = ifelse(is.na(Tuesday),0,Tuesday)
    ,Wednesday = ifelse(is.na(Wednesday),0,Wednesday)
  )

# changing rownames to centroid id
rownames(plot_data) <- plot_data[,"closest_cent"]

# making dataframe into matrix
plot_data <- plot_data %>% 
  subset(select = -c(closest_cent)) %>% 
  as.matrix()

# colors for graph
colors <- brewer.pal(9, "RdPu")[c(1, 6:9)]
colors[1] <- "#ffffff"

# plotting heatmap
plot_data %>% 
  heatmaply(
          plot_method = "plotly"
          ,colors = colorRampPalette(colors)
          ,dendogram = "both"
          ,show_dendrogram = c(FALSE, FALSE)
          ,label_names = c("Day", "Collection ID", "Number of Litter")
          ,grid_color = "white"
          ,main = "The count of Litter for each (Trash / Recyclying / Ash) Can"
          #ylab = "Collection Objects (ID)",
          ,xlab = "A distance of 0 means there are no objects around the Collection Object."
          ,key.title = "# of litter"
          ,showticklabels = c(TRUE, TRUE)
          ,column_text_angle = 0
          ,colorbar_len = .8
          ,grid_gap = 1
          ,Rowv = row_order
          ,Colv = clustered_data$day %>% unique()
          # ,cellnote = plot_data
          ) %>% 
  layout(width=800)

From this we can see that most collection objects have about 10-20 pieces of litter near it; however, some of them have well over 100. This tells could tell us a few things: the collection objects with higher counts had more traffic, they were fuller than others (meaning litter fell out of them), or people were lazier in those areas.

Most importantly though, we really don’t know why this is the case. But, we can try to see if there is a connection between the count of litter per cluster, and the average distance of litter.

Is there a connection between the average distance and the amount of litter?

To get an idea, lets look at some simple point plots, seperated by days as thats when each of these clusters are cleaned.

From this, we initially see no correlations. It appears there is a weak, positive, correlation on Wednesday. However, nothing looks convincing. None-the-less, let’s check out the Pearson Correlation R values.

## [1] "Wednesday"
##            num_litter mean_dist
## num_litter  1.0000000 0.3944412
## mean_dist   0.3944412 1.0000000
## [1] "Tuesday"
##            num_litter mean_dist
## num_litter  1.0000000 0.1144265
## mean_dist   0.1144265 1.0000000
## [1] "Monday"
##            num_litter mean_dist
## num_litter  1.0000000 0.0523233
## mean_dist   0.0523233 1.0000000
## [1] "Sunday"
##            num_litter mean_dist
## num_litter  1.0000000 0.1606947
## mean_dist   0.1606947 1.0000000

Here we can see the correlation values for each day. Again, nothing convincing, but we can see a weak correlation between the two variables on Wednesday. In general, this tells us that there does not seem to be an influence of how much litter is already around the area and a persons decision to get their litter closer to the collection object. However, we don’t have enough data to confirm this.

Searchable data table of clusters

Here is a searchable data table to view the clusteringss yourself, along with some basic statistics for each!

In total we have 42 collection objects. The average distance of the litter to the nearest collection object is on average 12.28 meters. From this we can assume that the average person drops their trash on the ground if they are a measly 12 meters away from the closest trash can. That is about 40 feet for us American folks. This means that event venues should place their trash collection objects in 40 foot intervals to maximize their effectiveness.

What does this all mean?

In summary, we found the that 14% of our 42 collection objects had an average distance of more than 20 meters. Suggesting that most collection objects had a good placement to collect as much trash as possible. We also found the average person drops their trash on the ground if a collection object is about 12 meters away.

Based on the clustering map, we can visually see that the clusters are grouped together well and none are excessively large. Unfortunately, not much could be gained from the time-series aspect of the litter produced over time, due to a clear (unintended) bias from the Rubbish team collecting litter in bursts, as opposed to consistently throughout the event.

We noted an extreme increase in paper-based litter being collected during the first day of the event. This is most likely due to flyers being handed out to the even participants. In the future, things like this could be avoided by placing more strategic trash cans.

Finally, we also found that statistically speaking, there does not seem to be an influence of how much litter is already around the area and a persons decision to get their litter closer to the collection object.

In general, we found some pretty awesome insights. However, there are a few inherent biases due to how the data was collected which severly limited the scope on the analysis. If these biases are addressed, much more could be done. Such as, finding the optimal collection object placements to minimize the amount of litter, hopefully finding more correlations on why people litter, designing a way to measure the efficacy of collection objects, finding the litter patterns of people during the event timeline, and lastly, helping keep the world a cleaner rock to live on!

My recomendations for litter data collection during events

To decrease time based biases litter should be collected consistently throughout the events day. This not only helps keep litter off the ground, but it helps show the trends of how litter is being produced over time. This can help immensely by seeing where litter collection might be better focused. For example, if a conference is happening, would it be fruitful to move a Recycling bin near the entrance for participants to discard their flyers on entrance or exit? Or would this prove useless? Would it be useful to move trash cans to the exits of the buildings before the events close? Questions like these could be answered with a full-scale time series.

To decrease bias in collection areas, try and spread out the collection team. Avoid team members collecting in the same areas at the same time. This will help get a better scope of litter across the whole event space, as opposed to specific areas. This improvement could help distinguish which areas are truly high or low volume areas, as well as help with distinguishing foot traffic and the corrosponding litter produced throughout the entire event. These insights could help venues focus cleanup efforts in specific areas when they know a high volume of litter will be produced.

Finally, to get a full scope of the trash collected during an event, data should be collected two days before the event and two days after the event. This would serve as a good baseline, to get an idea of how the event effect litter production in the short-term. Also, litter collection objects (i.e, Trash Can, etc.) should be logged, in percentage form, for how full a certain collection object is at that time. For example, at 2:00pm trash can #3 was 50% full. The team should also measure the amount of trash thrown away, into collection objects, by participants. This can be done by physically weighing the amount of trash from a collection object. This would be much easier than counting every object thrown away. These insights could help tremendously with finding ideal locations for the collection objects. Without the knowledge of how full these objects were, everything is speculation. For example, a high litter area could be due to large amounts of foot trafffic, or a full trash can. Both are drastically different circumstances and would be fixed in very different ways. This could help answer questions such as: how impactful is the placement of trash cans? Are there correlations between how full a trash can is and how much litter is produced around it? Are there correlations between the average distance of litter to a trash can and how full it is? How impactful are dedicated tobacco ash Cans, compared to a trash can?

Extras

A fun Sunburst graph!

Finally, I will leave you with this fun sunburst graph to play with. You can hover over each section to further subset the clusters by type of objects surrounding the collection objects, then further by the amount of that type collected on each respective day. Have fun!

Legend





Below are the code chunks for the makings and simple statistics used throughout this analysis. They are shoved to the bottom, so not to muddy up the written portion.

The head of the raw data

Number of litter collected for each day

Average distances to clusters for each day

 




A work by Alexander Kahanek x Rubbish, co.