c++ - store a high amount of HTML files -


I am working on an academic project (a search engine ), its main work search The engines are:

  1 / - crawling 2 / -storing 3 / -deption 4 / -pap ranking   

All sites that crawl my search engine locally Which means that this is a intranet search engine .

After storing files found by the crawler, these files need service quickly for caching purpose.

So I wonder what is the fastest way to store and retrieve these files?

The first idea to use IPT or SSH, but these protocols are connections based protocols, time to connect, search the file and make it long.

I have already read about the anatomy of Google, I have seen that they use a data repository, I want to do but I do not know how.

Notes: I am using Linux / Debian, and the search engine back-end has been coded using C / C ++. Storing personal files is very easy - wget - http://www.example.com

Be careful of the pages that are generated by the code (crawlable) of example.com, where you can access the content (or from where) The content is different depending on it.

One more thing to consider is that you may not really want to store all the pages manually, but actually move on to the site that actually include the pages - in this way, You only need to make a reference cache, which word is included, not on the whole page. Since many pages will contain very repetitive content, you only store unique words in your database and a list of pages that contain that word (if you filter the words that occur on almost every page, like "if You can reduce the amount of data that you need to store, calculate the number of each word on each page, and then compare it, and then, "", "and", "this", "to", "do", etc. Different pages, to find meaningless pages for searching.

Comments

Popular posts from this blog

Verilog Error: output or inout port "Q" must be connected to a structural net expression -

jasper reports - How to center align barcode using jasperreports and barcode4j -

c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed because another entity of the same type already has the same primary key value -