Everybody knows the Googlebot, but how many of you know how it works?
Some years ago I worked on a crawler called OpenCrawler and I learned very much from it.
To start you need to learn these standards:
- Hypertext Transfer Protocol
- Robots.txt exclusion standard
- RFC 3986
- Document Object Model
- User agent
and I suggest to learn and implement these too:
You need a database too, MySQL or PostgreSQL should be perfect for this work, don’t use the filesystem or a small DBMS.
When I wrote that implementation of a crawler I was improving PHP too so some things are not implemented: Sitemaps in a non-XML format, Sitemaps index, there’s a bad URL normalizer, nothing about other schema, etc…
Before you start indexing the web (or a domain) you have to decide one or more entry points. If you want to index the entire so web en.wikipedia.org/wiki/Main_Page is a good start because it contains very much public links and some good external links.
The database schema is not simple! You must think of what you need: fast indexing, archive only, a particular ranking, etc… for this example I’ll show the minimal database schema for archive only (but if you want you can contact me and I’ll help you design the right solution).
The tool I used for this image is not complete so I need to explain some things:
- the table links is ordered by the eldest (asc) visited attribute and this attribute can be null
- url is a text because your hypothetically URL length is infinite but you need to choose as everybody does: http://stackoverflow.com/a/417184/871913
it should be unique
- id is an incremental value, I choose 11 as a length for this primary key but this value is defined by the number of pages you’ll need to index
Why did I create these tables?
Links is used as fast control system to select the next URL your crawler will visit, in this table the visited attribute will be updated to the current timestamp everytime a crawler instance ask to the database the next URL to visit.
New links will be added with the visited attribute to null, in this way when you’ll ask to the database for next URL it will take to get next one and will be a new and never visited URL or the eldest one.
Archives is used to store web pages and it’s separated from the links table, it’s separated from links because Archives contains a very large attribute that slow down the queryes.
This is safe against a multiple crawler instances (when instance A asks for next URL, immediatly update the visited attributed for obtained row, in this way the database reorder rows and instance B will obtain the right next URL while A is processing) and against interruptions (all the crawler need to restart, is stored in the database).
After this step you must retrieve the robots.txt file, it’s always in the domain root and contains rules for search engines and bots.
This de-facto standard is important because bots can’t know the difference between these URLs:
- google.com/search?foo=bar : this URL is a fake, but why is it a fake? Because we understand immediatly that it’s not an URL generated by google.com, but our bot doesn’t have these information (and can’t have) so it should visit the page and understand it from the HTTP code (probably 404) and you would only lose time: the web grows minute after minute, you can’t lose time!
- google.com/search?q=bar: this URL is generated by google.com but with human interaction, everyone can generate infinite new URLs by typing everything they want, if the crawler falls down in this URLs and these contain new urls with similar parameters, the crawler will loop in useless URLs.
- google.com: this domain can be linked from other pages (yahoo.net/web-page, ask.com/test, etc…) but how can we know that it wants to be indexed?
When you parse the robots.txt rules pay attention: my suggestion is to choose a name for your bot, in my project was OpenCrawler [major].[minor].[bugfixes] (+[link]) but you can choose a random browser user agent, you’ll see the same contents of common browsers.
Remember: the User-Agent Te.st*-01 corresponds to the regular expression Te.st.*-01.
If you are visiting an URL of a domain that was never visited before, I recommend to visit the sitemap, it’s usually an XML file in the root of a domain, like example.com/sitemap.xml, and contains directly final URLs, but it can contain an index of sitemaps (yes, this process can be very long), you’ll visit it and add to the database all the new URLs you’ll find.
Attention! Before every insert in the database control that URL is unique because you lose a very important principle if an url (and relative content) is duplicated, a 10KB page is not a problem but a 10GB is a problem and you can’t be sure that’s the only duplicate.
For a better web I’d suggest to follow the extended robots.txt standard, but nobody ever listen me or this page so do what you want.
Now, if the URL you want to visit wasn’t visited so recently or is new and robots.txt allows you to visit it, you can finally visit the page and extract the content.
Sometimes you’ll be redirected to another page; it’s not a bad practice but remember the number of redirects, some spider traps are based on infinite redirects.
Check the HTTP code, if it’s 400 or more you shouldn’t store the page and continue with another URL, the reason is the same of robots.txt main usage.
Life with HTTP codes is not so simple, I hope you like them! :)
Now that you have the content, add or update it in the archives table (if you can) and pass it to the DOM parser, there are some reasons not to store the content:
- 400+ HTTP code
- there’s a meta tag with the attribute name=”robots” containing in the content attribute (wrote as a CSV) the word noindex
- there’s the HTTP header X-Robots-Tag: noindex
For anyone of these reasons you shouldn’t store the content, but you can process it.
Once the DOM parser has the content, you’ll be in the elaboration process; in this process you can extract all the information your special search engine needs: title, meta tags, body content, microformats, etc…
For this example we don’t look for anything else than anchors, those are necessary to continue the crawling, in fact we need to extract the <base> tag (if defined) and all the <a> tags.
For this process you’ll need an URL resolver based on the RFC 3986, pass the visited URL to it and then mix it with the base tag href attribute, than for each anchor (<a>) found without the rel=”nofollow” attribute, mix the anchor href attribute with obtained base and then every new URL obtained should be added in the links table with the visited attribute to null (it a new link), if it’s already in the links table don’t update the record.
Your new crawler now will restart ask to the database for the next it should visit.
This process can be separated in some other: a bot that looks for new domains in the database and extract sitemaps, another that looks for domains, fill and store the robots_txt table in the database, a bot that extracts new contents and a bot that processes them, etc…
Last link I reported above is a not-so-used standard purposed by Google, imho every web 2.0 website should implement it because search engines will never become humans.