AutoHotkey macros — joy and frustration

Once again I find myself using AutoHotkey to create macros for tedious cleanup work that can’t be done in an easy, automated fashion from within the ILS. Yes, you too can magically copy and paste fields if they appear in regular spaces.

And once again AHK is driving me nuts because it works sporadically and I can’t figure out why!! Hit the hotkey sequence once — it works. Move to the next record, either with cursor or mouse, and it doesn’t. Change the record manually. Then enter a third record. AHK works again.

Yes, this is used liberally in my macros:

Sleep, 100

I encountered this problem before, when we had a local server. The need for high sleep cycles has probably been worsened by having a hosted server with a slow connection for the ERM coverage info.

Anyway, if you have repetitive tasks that your ILS can’t manage, you may want to AHK a try, even if it’s not perfect. It will give your wrists a break.AHK

OpenRefine for cleaning data

On the “Introduction to Data Science” discussion forum, somebody mentioned GoogleRefine, now called OpenRefine. It’s a free desktop application to help clean up data; it also works with Freebase. I have only played around with OpenRefine a little but some might find this a better way to clean data than with Excel. More tools in the toolbox can’t hurt.

Excel file import tricks for the file-weary

If you work a lot with library title lists in Excel, you may have noticed a couple problems in the course of using the Import Wizard:

  • Diacritics get mangled, requiring multiple “find and replace” runs to fix the mess after the fact. Unless, of course, you remember to switch from ANSI to UTF-8 on the first screen of the Import Wizard.
  • Long ID numbers get turned into scientific notation, losing their uniqueness and thus their value. Also ISSN numbers beginning with “0” may lose their leading zero. A holding statement of (1990) may translate into -1990. It’s a real pain to click on each column and select Text individually.

It turns out you can at least set Excel to default to UTF-8 (Unicode) and there’s a quicker way to turn all your columns into text format, eliminating the annoying treatment that Excel does to number-like strings that aren’t really just plain integers.

Set the default character coding to UTF-8: (this worked for me on Excel 2007. For files where you get gobbledygook in UTF-8, change it back to ANSI).

STUFF: MISCELLANEOUS TOPICS ALL VAGUELY RELATED TO SCIENCE by Luke Miller. Change default text import origin type in Excel.

(Not for the faint of heart if you’re afraid to change registry values!)

Quick way to turn all columns into text while in the Import Wizard:

Allen Wyatt’s EXCELTIPS (Ribbon Interface). Faster Text File Conversions.

Basically, when you’re on the column screen, click on the first column while holding shift, scroll the columns all the way to the right so all the columns are highlighted, then select Text as the column data format.  If you have some “real numbers” you can always change them back later.

“Introduction to Database Concepts” part 1: The Four Steps of Database Design

In my previous post, I mentioned why it’s a good idea to freshen up on database concepts and recommended the first chapter of a Springer ebook, Pro SQL Server 2008 Relational Database Design and Implementation.  Louis Davidson,  Kevin Kline,  Scott Klein,  Kurt Windisch. 2008. Springer. ISBN: 978-1-4302-0866-2 (Print) 978-1-4302-0867-9 (Online).  So far it’s a very readable book with great real-world examples and a sprinkling of humor.

Chapter 1: Introduction to Database Concepts.

It’s recommended that one design a database with a proper understanding of the underlying concepts, despite urgent calls to come up with a usable product right away.

Database Design Phases

Consider future data needs. Plan carefully at the outset to avoid difficult changes and maintenance down the road.

Steps of database design process as defined in this book:

  1. Conceptual
    The conceptual phase is the high-level discovery and analysis of the entities and business rules involved with the data under consideration. This is where you determine your high-level data requirements.  Note that no tables or other practical database structure is designed in this phase.

    • Entities are the “People, places, and things” (p.4) required by the design. They are usually nouns. In the library context entities might be users, library locations and items in the catalog.
    • Business rules can be thought of as processes and rules of operation, such as the definition of specific user groups and their renewal periods, or the format of the ISBN.  Business rules can be very detailed or very broad, or anywhere in between. Conditional rules are possible; these may have to be implemented outside of the database itself.
  2. Logical
    The logical phase is where the designs from the conceptual phase become a blueprint. There is still no detailed tables or any practical database design in this phase.The following are defined in this phase:

    • Entities. “People, places and things”. Examples: patron, book, library branch.
    • Attributes for each entity. Attributes are qualities of the entity. Examples: patron name, patron type, patron ID.
    • Normalization (explained later in book)
    • Candidate keys. Candidate keys are potentially unique attributes of instances of entities. Keys are how the database accesses specific instances. Examples: patron ID, ISBN number.
    • Relationships and cardinalities (explained later in book)
    • Domains. The datatype for each attribute, and whether the data must be present or NULL is acceptable.
  3. Implementation

    This is where the rubber hits the road — the information from the previous phase is applied to a platform-specific database with data types, tables, constraints, and triggers. One should not stray far from the original design, data and integrity constraints that were defined in previous phases.Database security is an essential part of implementation. DW notes that most relational databases are used with SQL, which is vulnerable to SQL-injection attacks. Do not allow for damaging or embarrassing hacks by leaving the back door open.

  4. Physical

    The so-called physical phase is where consideration is given to physical storage methods and for performance-tuning. In the case of VMs (Virtual Machines) and the Cloud, one may not worry too much about physical storage space, but performance may still be an issue.

Learning about relational databases

In the course of looking for information on data cleaning and data structure, and not getting much out of the Introduction to Database Management course, I found that the first chapter of this Springer ebook looked promising.

Pro SQL Server 2008 Relational Database Design and Implementation. Louis Davidson, Kevin Kline, Kurt Windisch. Springer, 2008. ISBN: 978-1-4302-0866-2 (Print) 978-1-4302-0867-9 (Online).

You’re probably asking yourself, why should an ordinary non-Systems librarian concern herself with databases? And how can this software-specific book from 2008 help?

I’ve been gradually coming to the conclusion that librarians, at least ones involved with technical services and electronic resources, should have some passing knowledge of data and how it is commonly structured for the following reasons:

  1. Understand the underlying workings of the ILS, of most Web pages and apps, software in general.
  2. Structure data and accompanying workflows in a mindful and enlightened fashion.
  3. Better understand and communicate with I.T. folks.

Number 2 is foremost in my mind lately, pondering how to flesh out consistent, efficient, updated workflows. Also, if I want to delve more into Python again, could I make use of a relational database to, say, manage SUSHI input?  Even though we may all be heading down the path to nonrelational datastores (NoSQL) like Redis, it is useful to examine how data is traditionally structured. The higher level aspect of design is especially interesting to me.

As to this particular book, Chapter 1: Introduction to Database Concepts is just what it says, and a good one at that. In my next post, I will summarize the first part of the chapter.

New York Times article on dubious Open Access journals and fake conferences

Scientific Articles Accepted (Personal Checks, Too)

Gina Kolata. April 7, 2013.

” … some researchers are now raising the alarm about what they see as the proliferation of online journals that will print seemingly anything for a fee. They warn that nonexperts doing online research will have trouble distinguishing credible research from junk …”

See Jeff Beall’s Scholarly Open Access blog for more info.

The rising cost of journals and journal packages

Here’s a common woe of academic librarians.  Many of our large journal packages increase in price from 8-11% a year.  There’s similar pressure on the individual journal subscriptions, especially the science ones.

When library budgets are flat, declining, or only increasing by a few percent per year, how are libraries going to sustain these increases? Are we supposed to raid the book budget to pay for an outrageous 25% increase on a suite of psychology journals?  Is the college really going to sell the climbing wall to allow “unnamed university press” to tack on exorbitant new journals we never asked for?

Here’s a chart I ran in Excel, assuming a starting cost of $2,000,000 for library materials budget, $900,000 for electronic resources.  Library budget increases at 2% and electronic resources increasing at a very modest 5% over 20 years.


Notice how the electronic resources overtake the library’s material budget – gradually at first, then quickly gobbling up the overall budget.

At a liberal arts college, there’s still some commitment to the book — as a book in any format, and the book as printed word.  We are not about to eliminate the vast majority of our books so the journal publishers can profit more.

In the past few years I’ve been hearing that big publishers are snatching up journals left and right, leading to a feeding frenzy and price inflation.  Well the sharks better settle down because we just. Cannot. Sustain. 8% increases.