Where have I been?

I had pretty nasty laryngitis, and have also been very busy with the Introduction to Data Science class. The MOOC has been difficult. You really need to stay motivated to keep up with a class like this, which ranges from fascinating high-level concepts to nitty-gritty relational algebra and Python. I wish I had more SQL experience before embarking on this endeavor. Still I’m keeping in there.

At any rate, the MOOC has been taking up my extra energies lately. When the class is over, I hope to post more here.


Friday amusement: the many arms of EBSCO Industries

(reposted from an email to the Collection Development list at work)

So I was thinking about how there’s EBSCO the subscription agent and  EBSCOhost databases and how it gets confusing, so I should clarify that. EBSCO Industries is a large company which owns many children. Or let’s say, it’s a multi-tentacled beast.  EBSCO subscription agent is one arm, EBSCOhost databases comprise another arm, although they must sit next to each other on the corporate body.

Then I wondered what other products and services emerge from the other EBSCO tentacles? According to this web page http://www.ebscoind.com (under GROUPS), here are just the amusing ones I found:

So … invest in EBSCO Industries because no matter how the market turns, something has to be profitable??

Introduction to Data Science MOOC on Coursera: first week thoughts

Here’s how the Introduction to Data Science MOOC has been going.

We have been given the first lecture videos and the first assignment.  So far the lecture information is fascinating and the first two parts of the assignment are something I could bash my way through. I have my handy Linode account that I can install things on as I like, and have done some Python programming before. There is a class Virtual Machine but I prefer trying what I already have set up on Linode.

The “part 0” of Assignment 1 involved writing a very short Python program to retrieve some data from the Twitter feed. This was very easy to do.  And I used it to my amusement while watching Game of Thrones last night, changing the keyword search from “microsoft” to “gameofthrones”. I could observe other viewers’ reactions in real time. Quite a lot of rude tweets from the very start of the episode, which left me alternately amused and appalled.

Part 1 of Assignment 1 entailed getting an Twitter API account and setting that up, with some very thorough written and videoed instructions. Not too hard. Then we take our Twitter app keys, insert them into a pre-written Python program, and “pipe” the outputted Twitter stream into a file for 10 minutes. Finally, taking the first 20 lines of this output file and copying into another file, and submitting that as Assignment 1 Part 1.

Now the submission part got a little weird because every time I tried to submit, it was showing that I got 0/5 points, and no feedback available. Jumping over to the forums, I found that the robo-grader was taking a while to post back results.  Waiting some time and lo and behold, I got my 5/5 points. Next up is “sentiment analysis” where we take our big Twitter output file and run it against some kind of sentiment file.

I have done a little screen-scraping with Python and saw how easy it is to grab content from a web page and use it for my own purposes. Now I see how easy it is to grab content from Twitter and again, bend it to my own purposes. This leads to simultaneous exclamated thoughts of “Oh the privacy!” versus “I have the power!” Imagine the kind of snooping, market analysis and prediction you can do with Twitter data.

This leads me to wonder how data analysis can cope with shills, trolls, spammers, memes, and sarcasm/parody on the Internet.  For instance on certain comment sites, insular memes may become popular — the regulars understand the memes while outsiders may be mistakenly outraged or simply confused. But sometimes it can be very difficult even for regulars to distinguish a troll. How does a sentiment analysis program take all the irregularities into account?

MOOC time: Introduction to Data Science

I signed up for Introduction to Data Science through Coursera; it just started yesterday. Bill Howe from University of Washington is the instructor.

After watching the first 5 video lectures, I was intrigued and excited. This morning, I had some sobering thoughts. The subject matter doesn’t apply directly to my job. This class is covering Big Data, the kind that must be processed on multiple machines. I will be doing this class on my own time.

Will I continue taking this class? I’ll try it a while longer. So far, Professor Howe is a good speaker and makes good use of interesting real-world examples (Nate Silver, the presidential campaign, etc). I am able to comprehend the content. If nothing else, this gives me the hope that I won’t be just another uninformed consumer of the Internet; rather I’ll have some grasp of what’s going on behind the scenes.

Some other ulterior motives: I can talk about this with my Senior Software Architect boyfriend, who paid attention to some of the lecture material and found it interesting. And I can pretend that some day when I grow up, I will be a Data Science Analyst and make oodles of money.

…and as I went in to look at the course, Coursera appears to have gone down with a 500 error. Oops!