This is a continuation of my series of the sessions I attended at ALA 2013. Erin McKean is the former Oxford University Press editor-in-chief for dictionaries and currently the founder of Wordnik.com, a site which collects multiple definitions for words. I was familiar with her from her TED talk and was eager to hear her speak.
McKean calls herself a data packrat. She wants to collect as much data as possible in the areas in which she is interested. She sees this as primarily a problem of organization. She illustrated the problem with images of old hard drives, Moleskine notebooks, and boxes of sewing patterns in her home. Now McKean also has data she can save digitally. She uses an Evernote notebook and a Pinterest account, which she uses not for its intended purpose but to save texts that are of interest. She recommends a browser extension called “Findings” which allows one to clip a sentence and associate it with a URL, to serve as a reminder of why the URL was interesting in the first place.
McKean explained the relationship between data packratism and lexicography with reference to James Murray, the editor of the first edition of the Oxford English Dictionary. He gathered clippings of quotations showing words in context to use in the dictionary. McKean compared dictionaries to a slurry; lexicographers grind sentences into a paste. But it would be better to avoid the lunchmeat model and show the words in context. The definitions people give of words as they use them are free-range definitions and more closely correspond to how words are used in everyday life. McKean’s goal is to gather all these sentences and associate them with words. She traces this idea back to Wittgenstein and Richard Chevenix Trench, who had to do with the origins of the OED.
Today, lexicographers’ role is to find the words and the quotations, more than to make the slurry. They don’t want to be like Humpty Dumpty in Through the Looking Glass, arbitrarily dictating the meanings of words. McKean prefers to think of definitions as sculpture, not meatloaf—she listens to the people who are using the words in order to allow the definitions to reveal themselves.
Although more data is available now than in the nineteenth century, it is still difficult to get at the information that is needed. Sometimes this is a needle-in-the-haystack problem. Other times, the data is too messy to be usable. For instances, in some cases OCR does not work on these texts, and they are too numerous or extensive to transcribe by hand. Sometimes, when documents are opened, the text is not accessible. Data mining can be messy and destructive. Many amazing data sets are trapped in obsolete formats, legal copyright traps, or simple obscurity. And sometimes, things just aren’t knowable; she referred to an idea that one writer had read everything there is to read, but there are several versions of this story with different writers, and it isn’t possible to know to which writer, if any, this achievement should be credited. McKean started Wordnik with the intention of adding as much data as quickly as possible. She has a large collection of words that aren’t in dictionaries just because nobody has had the time to study them yet.
McKean addressed some of the problems with data hoarding and the reasons to hoard less. She has made great use of the Internet Archive; she doesn’t have to save all the websites because Brewster Kahle will save them for her. Libraries also play a role in preserving texts and making them available. She says, when the processing is ready, the data will appear (a play on the Zen saying, “When the student is ready, the master will appear”). Information overload is an old problem; McKean cited a newspaper article about excessive information availability from 1915. Instead of having data stores that people must seek out, McKean asks, what if data could be everywhere? She referred to some websites that allow data to be layered on top of what’s already known, like the site Findery, which allows people to attach notes or photographs to places. Much of this is user-generated. She imagined layering all the speeches that had been given in the conference room in which she spoke, admitting that this could be intimidating to speakers. Computers and storage are cheap now, but connecting people to the right information at the right time still is not easy.
Wordnik has an open API, and when it first began, its founders were worried that people would copy it, but that hasn’t really happened. What has happened is that many interesting projects have been built on its data, including Amazon Random Shopper (@tinysubversions), who uses it to buy random items on Amazon. Other projects have used it to write GRE study apps or just to cheat at Words with Friends. McKean and her team are now enabling data hoarding rather than engaging in it themselves and have learned that there are always more questions to answer.
McKean stressed the need to promote data sets and says that if any librarians email or tweet her about their data sets, she will blog and tweet about them.
She concluded by discussing why we want these data sets. For her, there is no overwhelming “why” to it; these data sets are beautiful because they are there. Every data set is its own Mount Everest.
The question and answer session was also fascinating, but this is getting a bit long, so I won’t summarize them. Ask me if you want more. In any case, this was one of my favorite sessions of the conference, even though I don’t have a practical use to put to it right now. McKean is an incredibly engaging speaker and her project is fascinating.