tl;dr:

The final presentation.

synopsis:

Community Detection and Network Analysis on a large body of emails. Created a dynamic web interface to find interesting properties of various networks using a email as the datasource. The emails came from Enron, Qualcomm, and several well known test networks were used.

what i did:

In 2013 I earned a Master of Science in Computational Science from SDSU. For my culminating project I created a web interface for the analysis of a very large body of emails. One could type in a search term, like a person or a project name, and see the network and stats relevant to that search item.

why i did:

The actual project was a requirement for graduation. I chose the project because I wanted something that would teach me something interesting and make good use of my experience and resources. The reason I was in the master's program is because I was "all but thesis" for my masters in aerospace engineering. For a number of reasons it didn't work out, but I always wanted to finish what I had started.

how i did:

At the time I was working at Qualcomm and had access to many years worth of emails that had been sent to various internal mailing lists. These were all kept as flat files in a hierarchy based on the the name of the mailing list. As a proof of concept I parsed out the date, size, sender, receiver(s), and subject of each email and dumped that into mysql. I created scripts in R and python to do network analysis and used shell to preprocess the intermediate files into a format the C++ code my advisor was interested in . The output of all this was put into JSON and fed to a d3 force-directed graph, datatables for the nodes and edges, and some other relevant network statistics.

my take away:

You don't need a lot of data to figure out what's going on in a large network. At about the time that I was graduating was when Snowden leaked what the NSA was doing. There were officials saying "all we collect is metadata, we don't listen in." I knew from my project that they had no need to listen in, all they had to do was know who was a phone number of one "bad guy" and the networks would show itself from the metadata. Very powerful stuff. I was also pleased to note that in the original "Google/backrub" paper, Page and Brin say that they too only used meta-data (the title of each page) to achieve "remarkable" results.

and finally:

My thesis is offline right now. I suspect its because the output of the R scripts have changed recently based on upgrades of the libraries. Some day when I have time to figure out what's going on I'll try and fix it. Earlier this year I binge watched Game of Thrones then found an GoT network (everything exists on the internet) and plugged it into my program. It surprised me who some of the more central characters are.

bonus featurette:

Bonus - here's the video that SlideShare deleted. (btw, it looked awesome in the original presentation. It was high res, embedded in the slide, and you couldn't tell it was video until after I'd explained what was happening and pressed "play".)