hadoop - Index the Raw HTML content using solr/lucene -

March 15, 2010

I have some HTML which I have scrapped the Web from the same site on different times. And raw data looks like this

  timestamp, htmlcontent (500KB) ..    I wrote a parser to parse some transparent areas. And I'm trying to build a search engine based on the field I read. Raw is not based on raw text but Raw is the complete HTML content & gt;  
 Now my data looks like this:  
  timestamp, htmlcontent, perforated field 1, perforated field 2    I have timestamp, perforated field 1 or Want to search for user for Parsoldfield 2 and my search engine mails raw HTML to the user's query and pops up the browser ... so it feels like a search engine time machine :)  < p> In this case, I am thinking that how do I index Should I do? The field that I should store and not what I'm following the book "Lucene in Action" and thinking how anyone can help to deal with this problem.  
 Based on my understanding, Schema.xml has some features ... index or not? Store or not? .... I think, "Whatever you want to include in the query results, it should be stored." In that case, I'll have to store the columns that contain raw HTML ...  
 Since that column is so large a record is usually about hundreds of KB ... with only hundreds of rows. You can easily get around 1 GB of dataset ... which will not work in SLR and I am trying to index those columns on the basis of Lusen and run into the problem of this mass. goes ..  
 Here's another idea: Maybe I'm Parseldfeld 1, Perceppy 2 and Pointer ... where the column is the full path of the column. The raw HTML file Of course, in this case, I have to make every HTML locally HDFS needs to be stored in a separate file ... then when the user searches for Parsed Field 1, he will return the full path and I will return those files ...  
 I think That i am clear of this problem As I like to have been able to describe that I'm thinking that someone could give me some directional guidance ...  
 Much appreciated!   
 
   some guidelines  1. You can store your data in XML or CSV or JSON format I need you xml 
 example .-- & gt; Your data in XML format 
  
  & lt; Add & gt; & Lt; Doc & gt; & Lt; Field name = "id" & gt; 01 & lt; / Field & gt; & Lt; Field name = "timestamp" & gt; AWL & lt; / Field & gt; & Lt; Field name = "parsedfield 1" & gt; Your data 1 & lt; / Field & gt; & Lt; Area Name = "Parsed Field 2" & gt; Java data & lt; / Field & gt; & Lt; Field name = "htmlcontent" & gt; Link to that HTML file & lt; / Field & gt; & Lt; / Doc & gt; & Lt; / Add & gt;    2 You need to modify schema.xml 
 
 - Each document should have a unique ID - you need only htmlcontent for your needs. The path needs to be stored - other field indexes only for search  
  & lt; Field name = "id" type = "string" index = "true" stored = "true" required = "true" multidimensional = "false" /> & Lt; Field name = "timestamp" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "parasidfeld1" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "parasadafi2" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "parasadafi2" type = "text_general" indexed = "true" stored = "wrong" /> & Lt; Field name = "htmlcontent" type = "text_general" index = "true" stored = "true" />    
 3 You can post all the XML files in You can use .jar or you can use the SOLARJ API if you need a program  
 
 ** Whether to store the field or not ** Unless you display them in the results Do not want to, unless you do not want to search only for the Store   

 



















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




Verilog Error: output or inout port "Q" must be connected to a
structural net expression -






March 15, 2010








    I get the error every time, I try to compile. I'm not sure why anyone can help is? I'm new to verilog.    Module D_FF (clerk, d, reset_n, q); Input D, Clack, reset_en; Output Q; reg Q; lab4_gdl f1 (.lk (~ clk), d (d), .q.m (qm)); lab4_GDL f2 (.Clk (Clk), D (QM), .Q (Q)); Always start at @ (posedge clk, neggeous reset_n) (Reset_n == 0) Q & lt; = 0; Other questions & lt; = D; Edit End: The problem is what we are asking to do:   In this section, you apply the memory / registration circuit on the AlterEdie 2 board. will do. The circuit has the following specifications:     The present value of swift SW15-0 on the D2 board should be shown in four hexadecimal four sections of HEX3-0. This part of the circuit will be combination logic.     To use an active-less asynchronous reset and KEY1 to use KEY0 as the clock input, you must store SW15-0in in a 16-bit register in the value The register should be enabled to have a 16-bit positive edge, which uses the embedded D flip-flo...





Read more





Installing croogo for cakephp -






September 15, 2012








    I'm new to the keyboard, I want to install Crogo in KKPP 2.3. My question: The way to extract the file is: - Changing cakephp2.3 / app / plugging / folder in app (do not work. Error admin controller not found) or - CakePP-2.3.3 / app / (Not working ). Sorry for my bad english language: (please help me and thank you.)     crogo  a  cms  by  kcfp  has been developed ... meaning  kkfp  is already there .. just  Croogo  Download and read their documentation ... and try Google about some CMS ...    





Read more





c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed
because another entity of the same type already has the same primary
key value -






May 15, 2015








    In brief, the wrapper module is inserted during posting and the status of an entry is changed to 'modified' . Before changing the state, the state is set to 'separate', but calling attachments () causes the same error to be thrown away. I am using EF 6.   Model    / wrapper class public class avi model {public A is a {get; Set; } Public listing & lt; B & gt; B {received; Set; } Public cc {receipt; Set; }}    Administrator    edit public functioning (int? id) {if (id == null) {new HttpStatusCodeResult (HttpStatusCode.BadRequest); } If (! Conveyor Access (id.Value)) new HTTPTitus code result (HTTP status code. Forbidden); Var aViewModel = new AViewModel (); aViewModel.A = db.As.Find (id); If (aViewModel.Receipt == zero) {return HttpNotFound (); } aViewModel.b = db.Bs.Where (x => x.aid == id.Value) .Oolist (); AViewModel.Vendor = db.Cs.Where (x => x.cid == aViewModel.a.cID) .FirstOrDefault (); See Return (aViewModel); } [HTPFost] [Valid AntitiferousTeacon...





Read more

Search This Blog

CH

hadoop - Index the Raw HTML content using solr/lucene -

Comments

Post a Comment

Popular posts from this blog

Verilog Error: output or inout port "Q" must be connected to a structural net expression -

Installing croogo for cakephp -

c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed because another entity of the same type already has the same primary key value -