c++ - store a high amount of HTML files -
I am working on an academic project (a search engine ), its main work search The engines are: All sites that crawl my search engine locally Which means that this is a intranet search engine . After storing files found by the crawler, these files need service quickly for caching purpose. So I wonder what is the fastest way to store and retrieve these files? The first idea to use IPT or SSH, but these protocols are connections based protocols, time to connect, search the file and make it long. I have already read about the anatomy of Google, I have seen that they use a data repository, I want to do but I do not know how. Notes: I am using Linux / Debian, and the search engine back-end has been coded using C / C ++. Storing personal files is very easy - Be careful of the pages that are generated by the code (crawlable) of example.com, where you can access the content (or from where) The content is different depending on it. One more thing to consider is that you may not really want to store all the pages manually, but actually move on to the site that actually include the pages - in this way, You only need to make a reference cache, which word is included, not on the whole page. Since many pages will contain very repetitive content, you only store unique words in your database and a list of pages that contain that word (if you filter the words that occur on almost every page, like "if You can reduce the amount of data that you need to store, calculate the number of each word on each page, and then compare it, and then, "", "and", "this", "to", "do", etc. Different pages, to find meaningless pages for searching.
1 / - crawling 2 / -storing 3 / -deption 4 / -pap ranking
wget - http://www.example.com
Comments
Post a Comment