Here’s how the Introduction to Data Science MOOC has been going.
We have been given the first lecture videos and the first assignment. So far the lecture information is fascinating and the first two parts of the assignment are something I could bash my way through. I have my handy Linode account that I can install things on as I like, and have done some Python programming before. There is a class Virtual Machine but I prefer trying what I already have set up on Linode.
The “part 0” of Assignment 1 involved writing a very short Python program to retrieve some data from the Twitter feed. This was very easy to do. And I used it to my amusement while watching Game of Thrones last night, changing the keyword search from “microsoft” to “gameofthrones”. I could observe other viewers’ reactions in real time. Quite a lot of rude tweets from the very start of the episode, which left me alternately amused and appalled.
Part 1 of Assignment 1 entailed getting an Twitter API account and setting that up, with some very thorough written and videoed instructions. Not too hard. Then we take our Twitter app keys, insert them into a pre-written Python program, and “pipe” the outputted Twitter stream into a file for 10 minutes. Finally, taking the first 20 lines of this output file and copying into another file, and submitting that as Assignment 1 Part 1.
Now the submission part got a little weird because every time I tried to submit, it was showing that I got 0/5 points, and no feedback available. Jumping over to the forums, I found that the robo-grader was taking a while to post back results. Waiting some time and lo and behold, I got my 5/5 points. Next up is “sentiment analysis” where we take our big Twitter output file and run it against some kind of sentiment file.
I have done a little screen-scraping with Python and saw how easy it is to grab content from a web page and use it for my own purposes. Now I see how easy it is to grab content from Twitter and again, bend it to my own purposes. This leads to simultaneous exclamated thoughts of “Oh the privacy!” versus “I have the power!” Imagine the kind of snooping, market analysis and prediction you can do with Twitter data.
This leads me to wonder how data analysis can cope with shills, trolls, spammers, memes, and sarcasm/parody on the Internet. For instance on certain comment sites, insular memes may become popular — the regulars understand the memes while outsiders may be mistakenly outraged or simply confused. But sometimes it can be very difficult even for regulars to distinguish a troll. How does a sentiment analysis program take all the irregularities into account?