I don’t know about you, but I remember back in the days when I first learned how to create web sites. Oh god, the sites that I made back then! And more importantly, the content that used to be on them. Like a web site I created for a nightclub in Waikiki called Pink Cadillac because I used to love going there. All the pictures and what happened at the nightclub back then I got on the site. Rubbing elbows; a.k.a. “collaborating”, with the Promoters, DJs, and Photographers to get the stuff online to flow with the modern era and reference in “Social Media” (i.e. Myspace before Facebook) back then. This like back in 2004 before smartphones and the Millennial posture was a thing.
What would I give to be able to see that stuff again! Fortunately, there’s a more or less a free service that archives public content. The awesome Internet Archive Wayback Machine. If only I knew about them back then and knew how archive a page with them.
So, here’s the thing about the Wayback Machine. In our current era of the technology word, websites are “dynamic” which means they’re fluid with content and it’s super easy to update a page or even just send out a post like this one. WordPress, Joomla, and other types of Content Management Systems use this dynamic method to easily update and build sites automatically. It’s all related to the development of HTML5 and the “language” of the sites not being absolute or “static“. I’m talking about URLs of displayed images and posts that are hyperlinked. The way that the Wayback machine works is that it can only; currently, archive “static” content. If you’re concerned or dismayed, don’t be. It’s not that hard to create static content that could be archived. The methods all involve caching a page or post.
Caching a dynamic page usually involves creating a static HTML page and serving that page to address high traffic issues of a server constantly having to render or parse a dynamic page. For WordPress, I use the WP Super Cache to generate the HTML pages and the Jetpack plugin to generate the sitemap.xml file that should pretty much have a reference to every page, image, and whatever on your site. Also, you’ll want to look into a “broken link” plugin thing because the Wayback Machine doesn’t archive links that report 404 errors. And it’d be very disappointing to get an error on the Wayback Machine if you browsing it.
And the good stuff… This is a BASH script that is routinely run via a cron job…. Does it work? You tell me. After this post I’ll just be using this script to archive this site. If things are not as planned I’ll note it on this post.
#!/bin/bash
sitemap="https://www.boydhanaleiako.me/sitemap.xml"
function CRAWL {
for page in $( curl $1 | sed 's/>/>\n/g' | sed 's/</\n</g' | egrep -e "^http" ); do
curl
sleep 30
CRAWL $page
done
}
for map in $( curl $sitemap | sed 's/>/>\n/g' | sed 's/egrep -e "^http" ); do
curl
sleep 30
CRAWL $map
done
One of the things you’ll like ask is why I don’t have a plugin or what not to doing the archiving. At this time, the Wayback machine doesn’t have a public API to archive a page. They have APIs to read archived information. But, from what I can tell there’s no real API in place yet. I can’t really blame them. In fact, I’m sure they’ll have something to say about this if they ever hear about it since they already archive a whole tons of data. You should check them out. They’re archiving a lot of awesome stuff.