One of the biggest repetitive headaches in my job is matching and manipulating title lists from various sources such as the ILS, link resolver, and vendors. These lists have different columns in different places, and the titles are often spelled out differently from list to list. When you are trying to determine if you have the correct title list in one system by comparing with another title list, it gets really hairy, really fast.
- Initial articles (The, An, A, etc.)
The Journal of Something Something vs. Journal of Something Something
- Ampersand and “And”
British Journal of Flotsam and Jetsam vs. British Journal of Flotsam & Jetsam
BFG: Big Frilled Gar vs. Big Frilled Gar vs B.F.G. vs. BFG
BFG: Big Frilled Gar vs. BFG : Big Frilled Gar vs. BFG. Big Frilled Gar
- Proceedings and ConferencesProceedings of the Association of Silly Walkers vs. Association of Silly Walkers: Proceedings
Today I endeavored to find books, articles, websites, or anything that might shed light on theory and best practices. Unfortunately I couldn’t even find the right terminology. “Normalization” doesn’t mean the same thing in the computer world as title normalization might mean to a librarian. After Google failed, I tried some targeted searching but didn’t find much (Code4Lib, Stack Overflow, LISTA with full text). “Data cleaning” for the business world entails cleaning up your customer list, such as addresses and names.
Well here is the most helpful thing I’ve found today, and it’s only practical tips in Excel, but something is better than nothing:
Top ten ways to clean your data (in Excel) from Microsoft
I have done a lot of data manipulation in Python, both with codes attacking csv or tsv files and on-the-fly IDLE stuff. Well you can do some of it in Excel. I still think the string slicing in Python is more efficient though. More on this subject later!