hadoop - Index the Raw HTML content using solr/lucene -
I have some HTML which I have scrapped the Web from the same site on different times. And raw data looks like this
timestamp, htmlcontent (500KB) .. I wrote a parser to parse some transparent areas. And I'm trying to build a search engine based on the field I read. Raw is not based on raw text but Raw is the complete HTML content & gt;
Now my data looks like this:
timestamp, htmlcontent, perforated field 1, perforated field 2 I have timestamp, perforated field 1 or Want to search for user for Parsoldfield 2 and my search engine mails raw HTML to the user's query and pops up the browser ... so it feels like a search engine time machine :) < p> In this case, I am thinking that how do I index Should I do? The field that I should store and not what I'm following the book "Lucene in Action" and thinking how anyone can help to deal with this problem.
Based on my understanding, Schema.xml has some features ... index or not? Store or not? .... I think, "Whatever you want to include in the query results, it should be stored." In that case, I'll have to store the columns that contain raw HTML ...
Since that column is so large a record is usually about hundreds of KB ... with only hundreds of rows. You can easily get around 1 GB of dataset ... which will not work in SLR and I am trying to index those columns on the basis of Lusen and run into the problem of this mass. goes ..
Here's another idea: Maybe I'm Parseldfeld 1, Perceppy 2 and Pointer ... where the column is the full path of the column. The raw HTML file Of course, in this case, I have to make every HTML locally HDFS needs to be stored in a separate file ... then when the user searches for Parsed Field 1, he will return the full path and I will return those files ...
I think That i am clear of this problem As I like to have been able to describe that I'm thinking that someone could give me some directional guidance ...
Much appreciated!
some guidelines 1. You can store your data in XML or CSV or JSON format I need you xml
example .-- & gt; Your data in XML format
& lt; Add & gt; & Lt; Doc & gt; & Lt; Field name = "id" & gt; 01 & lt; / Field & gt; & Lt; Field name = "timestamp" & gt; AWL & lt; / Field & gt; & Lt; Field name = "parsedfield 1" & gt; Your data 1 & lt; / Field & gt; & Lt; Area Name = "Parsed Field 2" & gt; Java data & lt; / Field & gt; & Lt; Field name = "htmlcontent" & gt; Link to that HTML file & lt; / Field & gt; & Lt; / Doc & gt; & Lt; / Add & gt; 2 You need to modify schema.xml
- Each document should have a unique ID - you need only htmlcontent for your needs. The path needs to be stored - other field indexes only for search
& lt; Field name = "id" type = "string" index = "true" stored = "true" required = "true" multidimensional = "false" /> & Lt; Field name = "timestamp" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "parasidfeld1" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "parasadafi2" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "parasadafi2" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "htmlcontent" type = "text_general" index = "true" stored = "true" />
3 You can post all the XML files in You can use .jar or you can use the SOLARJ API if you need a program
** Whether to store the field or not ** Unless you display them in the results Do not want to, unless you do not want to search only for the Store
Comments
Post a Comment