The Deep Web
This blog post is all about the deep web. I will try to explain the underlying concepts behind the deep web. I will figure out the following questions and try to answers those questions:
- What is the difference between "Surface Web","Deep Web" and "Dark Web"?
- What is the actual size of "Deep Web"?
- Where "Surface Web","Deep Web" and "Dark Web" reside at?
- Why traditional search engines (like Google/Yahoo/Bing) cann't index the "Deep Web"?
- How do traditional search engines (like Google) handle the "Deep Web" contents?
- How one can prevent his/her website go into the "Deep Web"?
- How we can search the "Deep Web"?
- What is my proposed methodology to search the "Deep Web"?
- What is the business value of the "Deep Web" extraction?
I will divide the whole deep web post into sub-posts to answers the above mentioned queries in once's mind. So, in this post, I will give answer to first few questions. The sub-sequent posts will explain the futher queries.
Surface Web
The part of internet that can be found via link-crawling techniques; link-crawling means linked data can be found via a hyperlink from the homepage of a domain, is called Surface Web[3]. Traditional search engines(like Google/Yahoo/Bing) can find the surface web data.
Deep Web
The part of the internet that is not accessible by the traditional search engines i.e. link crawling search engine (like Google etc.), is called
Deep Web [3] . The only way a user can access this portion of the internet is by typing a directed query into a web search form,
thereby retrieving content within a database that is not linked. In layman’s terms, the only way to access
the Deep Web is by conducting a search that is within a particular website.
The other alternative terms used for Deep Web are as: Deep Net, Invisible Web, or Hidden Web.
Dark Web
The Dark Web [3] refers to any web page that has been concealed to hide in plain sight or reside within a
separate, but public layer of the standard internet.
The internet is built around web pages that reference other web pages; if you have a destination web page
which has no inbound links you have concealed that page and it cannot be found by users or search engines.
One example of this would be a blog posting that has not been published yet. The blog post may exist on the
public internet, but unless you know the exact URL, it will never be found.
Other examples of Dark Web[3] content and techniques include:
- Search boxes that will reveal a web page or answer if a special keyword is searched.
- Sub-domain names that are never linked to; for example, “internal.raoumer.com”
- Relying on special HTTP headers to show a different version of a web page. e.g. Desktop versone vs mobile vesion website
- Images that are published but never actually referenced, for example “/images/logo.png”
Virtual Private Networks(VPN) [3] are another aspect of the Dark Web that exists within the public internet,
which often requires additional software to access. TOR (The Onion Router) is a great example. Hidden within the public
web is an entire network of different content which can only be accessed by using the TOR network.Users must use an anonymizer to access TOR Network/Dark Web websites. The Silk Road, an online
marketplace/infamous drug bazaar on the Dark Web, is inaccessible using a normal search engine or web browser.
According to Forbes.com:
The Web is like an iceberg divided into three segments, each with its own cluster of hangouts for cyber criminals and
their digital breadcrumbs. The tip of the iceberg is the “Clear Web” (also called the Surface Web), indexed by Google and other
search engines. The very large body of the iceberg, submerged under the virtual water, is the “Deep Web”—anything on the Web
that’s not accessible to the search engines (e.g., your bank account). Within the Deep Web lies the “Dark Web,” a region of
the iceberg that is difficult to access and can be reached only via specialized networks.
Now, afterward, my discussion will be totally focus on the Deep Web. Here, I am not covering the Dark Web, which is illegal use of internet. So, let's go into the depth of Deep Web.
The following figure explains the Surface Web, Deep Web and Dark Web contents briefly. We are only focus on Deep Web contents that are legal information for web searchers.
Here is the another graphical representation of size and relevancy of Deep Web:
Now, we take the real use cases to understand the deep web comprehensibly.
Let's take hdmoviespoint.com website,
we can download a particular movie by clicking the download button, that refers to another webpage, which contained the download link of a specific movie.
In this case, this download link page is only accessible for going hdmoviespoint website, but, not searchable into Google search result. So, this content
refers to the deep web content.
Let's take another use case of deep web. Think of searching for government grants; most researchers start by searching “government
grants” in Google, and find few specific listings for government grant sites that contain databases. Google
will direct researchers to the website www.grants.gov, but not to specific grants within the website’s database.
Researchers can search thousands of grants at www.grants.gov by searching the database via the website
search box. In this example, a Surface Web search engine (Google) led users to a Deep Web website (www.
grants.gov) where a directed query to the search box brings back Deep Web content not found via Google search.
The quantified size[2] and relevancy of the deep Web on data, here are the following key findings include in recent years as:
- Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
- The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web.
- The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.
- More than 200,000 deep Web sites presently exist.
- Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web forty times.
- On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet-searching public.
- The deep Web is the largest growing category of new information on the Internet.
- Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
- Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.
- Deep Web content is highly relevant to every information need, market, and domain.
- More than half of the deep Web content resides in topic-specific databases.
- A full ninety-five per cent of the deep Web is publicly accessible information -- not subject to fees or subscriptions.
Methods[1] which prevent web pages from being indexed by traditional search engines, may be categorized as one or more of the following:
-
Contextual Web:
Pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence).
-
Dynamic content:
Dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge.
-
Limited access content:
Sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard or CAPTCHAs, or no-store directive which prohibit search engines from browsing them and creating cached copies).
-
Non-HTML/text content:
Textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.
-
Private Web:
Sites that are required registration and login (password-protected resources).
-
Scripted content:
Pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or Ajax solutions.
-
Software:
Certain content is intentionally hidden from the regular Internet, accessible only with special software, such as Tor, I2P, or other darknet software. For example, Tor allows users to access websites using the .onion host suffix anonymously, hiding their IP address.
-
Unlinked content:
Pages which are not linked to by other pages, which may prevent web crawling programs from accessing the content. This content is referred to as pages without backlinks (also known as inlinks). Also, search engines do not always detect all backlinks from searched web pages.
-
Web archives:
Web archival services such as the Wayback Machine enable users to see archived versions of web pages across time, including websites which have become inaccessible, and are not indexed by search engines such as Google.
-
Searchable Databases:
Searchable databases that are the most valuable Deep Web resources.
Here, we would only focus on the "dynamic contents" and "searchable databases" for
extracting the deep web contents.
So, I hope this post will help to understand the underlying concepts of Deep Web very well. In the next post, we would explore the further deep web,
its extraction method and business value. So, please, stay tuned with future posts.