I was lucky enough to attend Digital Humanities 2014 in Lausanne, Switzerland this summer. One of the topics I was most interested in learning about was web archiving and scraping. I took a workshop led by Scott Reed from the Internet Archive on their Archive-It web-archiving service. At the workshop I ran into Ian Milligan, professor in the Department of History at the University of Waterloo, who suggested I check out the web scraping tutorials on the Programming Historian website.
The most rewarding part of this workshop was being able to create my very own web archive of the websites for Rick Bébout and the CLGA. Getting full-text access to searchable archives of these sites will hopefully be of great use when doing research for the LGLC project.
For more on web archiving, check out Ian Milligan and Nick Ruest’s presentation The Great WARC Adventure: WARCs from creation to use: “This presentation will cover a historical overview of web archiving, how best to both capture and preserve websites, and make them discoverable and usable using open source tools that can be easily replicated by other organizations, the interplay of the archivist and historian with respect to web archives, and finally ways to access web archives”
The International Internet Preservation Consortium has also provided a small directory of open source web archiving tools.