Spider
Welcome to ontolux’s information page on the spider we use. You can find the German version here.
On behalf of our customer, the German Federal Office for Information Security (BSI), our spiders search the World Wide Web (WWW) specifically for content for further processing.
If you have problems with one of our spiders, we offer you information here to solve your problem. You will find the most important information about our spiders as well as solutions to the most common problems. Of course, you can also contact us personally at any time and tell us about your problem by e-mail at bsi@ontolux.de.
Information for experienced website operators
If you as a website operator already have experience with entries in robots.txt, please use the following user agent identifier to control our spiders on your website. Otherwise, we ask you to first read the detailed instructions found in the FAQs below.
The user agent ID is: BSI-Robot (Federal Office for Information Security Germany; https://www.ontolux.de/spider; bsi@ontolux.de)
FAQ
What is ontolux spidering websites for?
As a brand of Neofonie GmbH, ontolux implements text- and data-based solutions. As the founder of the first German-language search engine, we have decades of experience, especially in the field of search technologies. In the course of realizing customer projects and scientific research projects, the German-language web is usually searched for data, which is then analyzed using scientific methods and enriched with information obtained from it. For this purpose, only the publicly accessible web is used as a basis and all data protection guidelines are strictly adhered to. As is generally known from search engines, original content is referenced exclusively via links. Since the newly indexed content is usually made available to a much wider public, the resulting benefit for individual website operators is enormous, since their pages have become discoverable via newly created.
What is and what makes a spider?
The general principles for spidering websites express the good will of the operators to act responsibly on the Internet. All spiders used by ontolux also want to meet the interests of the operators of websites and to fetch content from websites as far as possible without any interference with web servers.
Nevertheless, you as the operator of a website should be able to control access to your pages and decide what you want to make available to the public and what not.
The so-called „Robot Exclusion Standard Protocol“ (http://de.wikipedia.org/wiki/Robots_Exclusion_Standard) was created for this purpose. This specifies that when a spider visits a web server, it first searches for, downloads and evaluates a file called „robots.txt“ in the root directory of the server. The rules contained in this file can be used to keep a spider away from certain areas of a website or even block it out completely. In addition, this file can tell the spider that the spider might use a sitemap file (http://de.wikipedia.org/wiki/Sitemaps). What you specifically need to do to restrict access for a spider from ontolux is described in detail below.
In addition or as an alternative to the described procedure, you can use the HTML meta tag „robots“ for individual pages (http://de.selfhtml.org/html/kopfdaten/meta.htm) to control the indexing or following of contained links.
How can I restrict the access of the ontolux spider with the „robots.txt“?
To view the current content of your robots.txt file, you can simply append a „/robots.txt“ to the URL of your web presence to have the content displayed in a browser, if available.
To prohibit one of our spiders from accessing certain areas of your web presence, you can add the following lines to your robots.txt file in the root directory of your web server, for example:
# Spider from ontolux should not be downloaded from folders /pictures and /personal
User-agent: BSI-Robot
Disallow: /pictures/
Disallow: /personal/
These restrictions allow the spider with the user agent identifier „BSI-Robot“ to download all found links from your site, except for the files whose path contains the /pictures or /personal folders.
I am not familiar with server configurations. What can I do?
If the above entries in the robot.txt file do not help you, you can send an email to our service. We will then try to solve the problems with you or configure our spider so that the accesses to your web server are satisfactory for you.
What can I do if the spider queries my site too often?
If the Spider from ontolux queries your website too often, then please let us know this as well! We can customize the spider and configure it accordingly. You can contact us via the mail address above. Thank you!
Can I check the user agent identifier of the spider?
To find out which identifier a spider uses to access your pages, you can view the requested pages in the access file for your web server. In normal configuration, the identifier of the requesting user agent should be logged in this file.
I have entered a user agent identifier in robots.txt, but my pages are still visited?
If, despite the above entries in robots.txt, one of our spiders continues to visit pages that you have marked as blocked, please first check the compliance of your robot.txt file. If the cause of a write error or similar is excluded, please check in the access file for your server with which identifier the requesting spider identifies itself.
The User-Agent-ID is „BSI-Robot (Federal Office for Information Security Germany; https://www.ontolux.de/spider; bsi@ontolux.de)“. Based on this log entry, you could enter „BSI-Robot“ as the user agent identifier in the robots.txt file to address our spider with it.
If you are visited by one of our spiders in violation of the rules despite a correctly created robots.txt file, please contact us and inform us immediately with the determined user agent ID so that we can immediately check the spider in question.
We thank you!