Where have I been?

I had pretty nasty laryngitis, and have also been very busy with the Introduction to Data Science class. The MOOC has been difficult. You really need to stay motivated to keep up with a class like this, which ranges from fascinating high-level concepts to nitty-gritty relational algebra and Python. I wish I had more SQL experience before embarking on this endeavor. Still I’m keeping in there.

At any rate, the MOOC has been taking up my extra energies lately. When the class is over, I hope to post more here.

Introduction to Data Science MOOC on Coursera: first week thoughts

Here’s how the Introduction to Data Science MOOC has been going.

We have been given the first lecture videos and the first assignment.  So far the lecture information is fascinating and the first two parts of the assignment are something I could bash my way through. I have my handy Linode account that I can install things on as I like, and have done some Python programming before. There is a class Virtual Machine but I prefer trying what I already have set up on Linode.

The “part 0” of Assignment 1 involved writing a very short Python program to retrieve some data from the Twitter feed. This was very easy to do.  And I used it to my amusement while watching Game of Thrones last night, changing the keyword search from “microsoft” to “gameofthrones”. I could observe other viewers’ reactions in real time. Quite a lot of rude tweets from the very start of the episode, which left me alternately amused and appalled.

Part 1 of Assignment 1 entailed getting an Twitter API account and setting that up, with some very thorough written and videoed instructions. Not too hard. Then we take our Twitter app keys, insert them into a pre-written Python program, and “pipe” the outputted Twitter stream into a file for 10 minutes. Finally, taking the first 20 lines of this output file and copying into another file, and submitting that as Assignment 1 Part 1.

Now the submission part got a little weird because every time I tried to submit, it was showing that I got 0/5 points, and no feedback available. Jumping over to the forums, I found that the robo-grader was taking a while to post back results.  Waiting some time and lo and behold, I got my 5/5 points. Next up is “sentiment analysis” where we take our big Twitter output file and run it against some kind of sentiment file.

I have done a little screen-scraping with Python and saw how easy it is to grab content from a web page and use it for my own purposes. Now I see how easy it is to grab content from Twitter and again, bend it to my own purposes. This leads to simultaneous exclamated thoughts of “Oh the privacy!” versus “I have the power!” Imagine the kind of snooping, market analysis and prediction you can do with Twitter data.

This leads me to wonder how data analysis can cope with shills, trolls, spammers, memes, and sarcasm/parody on the Internet.  For instance on certain comment sites, insular memes may become popular — the regulars understand the memes while outsiders may be mistakenly outraged or simply confused. But sometimes it can be very difficult even for regulars to distinguish a troll. How does a sentiment analysis program take all the irregularities into account?

MOOC time: Introduction to Data Science

I signed up for Introduction to Data Science through Coursera; it just started yesterday. Bill Howe from University of Washington is the instructor.

After watching the first 5 video lectures, I was intrigued and excited. This morning, I had some sobering thoughts. The subject matter doesn’t apply directly to my job. This class is covering Big Data, the kind that must be processed on multiple machines. I will be doing this class on my own time.

Will I continue taking this class? I’ll try it a while longer. So far, Professor Howe is a good speaker and makes good use of interesting real-world examples (Nate Silver, the presidential campaign, etc). I am able to comprehend the content. If nothing else, this gives me the hope that I won’t be just another uninformed consumer of the Internet; rather I’ll have some grasp of what’s going on behind the scenes.

Some other ulterior motives: I can talk about this with my Senior Software Architect boyfriend, who paid attention to some of the lecture material and found it interesting. And I can pretend that some day when I grow up, I will be a Data Science Analyst and make oodles of money.

…and as I went in to look at the course, Coursera appears to have gone down with a 500 error. Oops!

VM Hosting with Linode

Disclaimer: This is not an ad – I don’t own stock in Linode and I’m not getting a discount.

I wanted to have my own full-featured VM (Virtual Machine) to muck around with. After a misstep or two, I decided to try Linode  (short for Linux Node) and I’m very happy with it. You have your very own Linux machine to play with, in fact you can pick from a large selection of Linux distributions. The documentation is great and so is the technical support. I can use this as a portable Python machine for work (since I can use it from the Reference Desk via PuTTY), for learning, and also for personal stuff like mucking about in Drupal. At $19.95 a month for the basic setup, Linode is a pretty good deal.

If you have ambitions of becoming a Systems Librarian some day, you really ought to be familiar with Linux, UNIX or similar systems. Part of being a Systems Librarian is being a SysAdmin, and the Systems tend to be of the UNIX-flavored variety.

Procrastination

One of my life-long, serious faults is procrastination.  That should be qualified with adding that I mostly get the important stuff done on time, but often wait until the last minute. I pay my bills on time, authorize time sheets before the deadline, etc. Seems like it would be better to be super-organized and not feel anxiety about last-minute tasks.

Here’s a Lifehacker post on procrastination. I like the Pomodoro technique where one sets a timer and does work until time is up, then you take a break as a reward. I think the real point is that you’re distracted from anxiety and pressure with the timer, allowing you to get some work done as you get past emotional blockages or whatever’s obstructing you.

The Lifehacker article mentions apps for blocking distracting websites like Facebook, which can serve as easy ways to procrastinate. I’m not a Facebook addict so that wouldn’t work for me. However I must admit the constant email notifications can really throw me. Especially when I’m transitioning to another task and an email pops up that I want to reply to — after I hit send, I’m left with that “What was I doing?” feeling. Maybe I should turn off email notification in Outlook?

This Chronicle of Education article on structured procrastination is also interesting. The author recommends setting up some long-term task as the one to concentrate procrastination on, so the shorter-term jobs get done.

I really should try to implement one or more of these strategies but I have to check my email …

Trying to understand RDA and FRBR

Library folks, including myself, have been attending group Webinars from Lyrasis on RDA. The latest one was this morning.

Firstly, the classes are held at such a time that because of time zone issues, here in Colorado they usually start at 8 AM. Some of the organizing librarians have gracefully provided bagels and coffee. Hey Lyrasis, you’ll be in trouble if you start expanding into the Pacific time zone!

Secondly, I am finding RDA rather confusing. Back when I was in library school when RDA was still on the distant horizon, it seemed like it would be a radical change from AACR2. But now it mostly feels like a lot of tweaking to AACR2.  And some RDA chapters aren’t even written yet! We wondered if philosophical verbiage was just being larded onto the traditional MARC structure.

Maybe the change isn’t that radical because we’re still tied to MARC format? What will happen with the proposed Bibliographic Framework?

Another point of confusion is the FRBR terminology of Entity, Work, Expression, Manifestation and Item.  The  teacher for this Webinar series, Christee Pascale, is a good speaker but she breezes through with these terms and after awhile it starts flying over our heads.

Here’s a helpful LIS Wiki page on the subject:

http://liswiki.org/wiki/FRBR

Here are a couple of possibly helpful resources from Library of Congress:

What is FRBR?  A conceptual model for the bibliographic universe / Barbara Tillett

FRBR Quiz 1: FRBR Terminology

If you really, really want to know more about the background leading up to FRBR, here’s a book chapter on the history of cataloging. “FRBR and the History of Cataloging” / William Denton from Understanding FRBR: What it is and How It Will Affect Our Retrieval Tools. Ed. by Arlene G. Taylor. Libraries Unlimited. 2007.

Excel file import tricks for the file-weary

If you work a lot with library title lists in Excel, you may have noticed a couple problems in the course of using the Import Wizard:

  • Diacritics get mangled, requiring multiple “find and replace” runs to fix the mess after the fact. Unless, of course, you remember to switch from ANSI to UTF-8 on the first screen of the Import Wizard.
  • Long ID numbers get turned into scientific notation, losing their uniqueness and thus their value. Also ISSN numbers beginning with “0” may lose their leading zero. A holding statement of (1990) may translate into -1990. It’s a real pain to click on each column and select Text individually.

It turns out you can at least set Excel to default to UTF-8 (Unicode) and there’s a quicker way to turn all your columns into text format, eliminating the annoying treatment that Excel does to number-like strings that aren’t really just plain integers.


Set the default character coding to UTF-8: (this worked for me on Excel 2007. For files where you get gobbledygook in UTF-8, change it back to ANSI).

STUFF: MISCELLANEOUS TOPICS ALL VAGUELY RELATED TO SCIENCE by Luke Miller. Change default text import origin type in Excel.

(Not for the faint of heart if you’re afraid to change registry values!)


Quick way to turn all columns into text while in the Import Wizard:

Allen Wyatt’s EXCELTIPS (Ribbon Interface). Faster Text File Conversions.

Basically, when you’re on the column screen, click on the first column while holding shift, scroll the columns all the way to the right so all the columns are highlighted, then select Text as the column data format.  If you have some “real numbers” you can always change them back later.

“Introduction to Database Concepts” part 1: The Four Steps of Database Design

In my previous post, I mentioned why it’s a good idea to freshen up on database concepts and recommended the first chapter of a Springer ebook, Pro SQL Server 2008 Relational Database Design and Implementation.  Louis Davidson,  Kevin Kline,  Scott Klein,  Kurt Windisch. 2008. Springer. ISBN: 978-1-4302-0866-2 (Print) 978-1-4302-0867-9 (Online).  So far it’s a very readable book with great real-world examples and a sprinkling of humor.

Chapter 1: Introduction to Database Concepts.

It’s recommended that one design a database with a proper understanding of the underlying concepts, despite urgent calls to come up with a usable product right away.

Database Design Phases

Consider future data needs. Plan carefully at the outset to avoid difficult changes and maintenance down the road.

Steps of database design process as defined in this book:

  1. Conceptual
    The conceptual phase is the high-level discovery and analysis of the entities and business rules involved with the data under consideration. This is where you determine your high-level data requirements.  Note that no tables or other practical database structure is designed in this phase.

    • Entities are the “People, places, and things” (p.4) required by the design. They are usually nouns. In the library context entities might be users, library locations and items in the catalog.
    • Business rules can be thought of as processes and rules of operation, such as the definition of specific user groups and their renewal periods, or the format of the ISBN.  Business rules can be very detailed or very broad, or anywhere in between. Conditional rules are possible; these may have to be implemented outside of the database itself.
  2. Logical
    The logical phase is where the designs from the conceptual phase become a blueprint. There is still no detailed tables or any practical database design in this phase.The following are defined in this phase:

    • Entities. “People, places and things”. Examples: patron, book, library branch.
    • Attributes for each entity. Attributes are qualities of the entity. Examples: patron name, patron type, patron ID.
    • Normalization (explained later in book)
    • Candidate keys. Candidate keys are potentially unique attributes of instances of entities. Keys are how the database accesses specific instances. Examples: patron ID, ISBN number.
    • Relationships and cardinalities (explained later in book)
    • Domains. The datatype for each attribute, and whether the data must be present or NULL is acceptable.
  3. Implementation

    This is where the rubber hits the road — the information from the previous phase is applied to a platform-specific database with data types, tables, constraints, and triggers. One should not stray far from the original design, data and integrity constraints that were defined in previous phases.Database security is an essential part of implementation. DW notes that most relational databases are used with SQL, which is vulnerable to SQL-injection attacks. Do not allow for damaging or embarrassing hacks by leaving the back door open.

  4. Physical

    The so-called physical phase is where consideration is given to physical storage methods and for performance-tuning. In the case of VMs (Virtual Machines) and the Cloud, one may not worry too much about physical storage space, but performance may still be an issue.

Learning about relational databases

In the course of looking for information on data cleaning and data structure, and not getting much out of the Saylor.org Introduction to Database Management course, I found that the first chapter of this Springer ebook looked promising.

Pro SQL Server 2008 Relational Database Design and Implementation. Louis Davidson, Kevin Kline, Kurt Windisch. Springer, 2008. ISBN: 978-1-4302-0866-2 (Print) 978-1-4302-0867-9 (Online).

https://i2.wp.com/0-link.springer.com.tiger.coloradocollege.edu/static-content/0.6406/covers/books/280/9781430208662.jpg

You’re probably asking yourself, why should an ordinary non-Systems librarian concern herself with databases? And how can this software-specific book from 2008 help?

I’ve been gradually coming to the conclusion that librarians, at least ones involved with technical services and electronic resources, should have some passing knowledge of data and how it is commonly structured for the following reasons:

  1. Understand the underlying workings of the ILS, of most Web pages and apps, software in general.
  2. Structure data and accompanying workflows in a mindful and enlightened fashion.
  3. Better understand and communicate with I.T. folks.

Number 2 is foremost in my mind lately, pondering how to flesh out consistent, efficient, updated workflows. Also, if I want to delve more into Python again, could I make use of a relational database to, say, manage SUSHI input?  Even though we may all be heading down the path to nonrelational datastores (NoSQL) like Redis, it is useful to examine how data is traditionally structured. The higher level aspect of design is especially interesting to me.

As to this particular book, Chapter 1: Introduction to Database Concepts is just what it says, and a good one at that. In my next post, I will summarize the first part of the chapter.

Self-education for free (or cheap)

Feel like you need to brush up on, or learn completely new areas? In this era, librarians have many tools for self-education. To learn more about a subject area you’re liaison to, or to learn a programming language, or be able to understand what your colleagues are talking about, or computing concepts that would help you do your job better, the following may be hepful. (and maybe lead to more lucrative positions/promotions in the future?)

  • Books and ebooks through your library
    • Don’t discount the availability of learning materials that you can access right now through your place of employment (print and e books) or public library. I have found a great source in our Springer e-books collection, and also the “Very Short Introduction to …” print book series from Oxford. Interlibrary Loan is always an option.
    • One caveat: computer-related books tend to go out of date quickly as new versions of software are released, so try to get the most recent editions and be prepared for occasional glitches when something doesn’t work quite right because you’re using a different version than the author(s) did.
  • MOOCs and other online learning systems
    • MOOCs (Massive Open Online Courses) by Coursera, Saylor.org and many others are providing classes that are either ongoing scheduled classes or self-directed learning. These are typically free and unaccredited options that you can try without much investment or consequence if you don’t like a particular class. I initially signed up for a Coursera class on Algorithms but quickly found it was too difficult and out of my league (and too theoretical). I am about to sign up for a Saylor class on databases — if it’s too advanced, I’ll opt out. No harm, no foul, no money lost.
    • For self-directed online learning on computer-related topics (without the homework or exams), try w3chools.com and CodeAcademy.
  • Take a “real” class.
    Does your institution allow you to take classes for free or cheap? What about local options? What about departments you liaise to — can you sit in on a professor’s class for one semester? (Clear that with your boss first!)
  • Rampant Googling and Wiki-ing.
    There is no reason to be confused about terms or concepts when you can pop them into Google or Wikipedia and Read All About It. When in conversation with people who are talking in technical terms that I don’t understand, I try to make written or mental notes and look up the mysterious stuff later. For my learning style, this can help me gain at least superficial understanding and create a foundation for deeper learning later on. [Use your information literacy powers to determine if you’re reading credible sources.]
  • Educational videos.
    Khan Academy and others have many free educational videos on the web, why not take advantage of them? The ubiquity of comments helps with information literacy concerns, as you will note if somebody has left out something of import.Incidentally if you need to learn anything “mundane” from how to crochet to how to chop an onion efficiently, YouTube probably has several dozen videos on the subject.