c# - Extract entire text from PDF with iTextSharp -


I am trying to parse the existing parameters to be added to the existing database, there is a problem with parsing pdf .

Try first
  string [] AllPdf = Directory.GetFiles (Directory.GetCurrentDirectory (), "* .pdf", SearchOption TopDirectoryOnly); foreach (all pdfdoc in allPdf) {use (PDF reader reader = new PDF reader (PDFF) (for {int page = 1; page & lt; = reader.NumberOfPages; page ++} {ITextExtractionStrategy strategy = new LocationTextExtractionStrategy ( ); String text = pdfext extractor Gatetextfram (reader, page, strategy); }}}   

But unfortunately the text was parsed only after the title (employer, website, language etc.). And I need a title to create a square that will be mapped to a relation in the database.

Try second
  string [] AllPdf = Directory.GetFiles (Directory.GetCurrentDirectory (), "* .pdf", SearchOption.TopDirectoryOnly); (Left) [byte] streamBytes = reader.GetPageContent (page); (For external websites) PDF Reader (PDFF) PRTokeniser tokenizer = New PRTokeniser (New RandomAccessFileOrArray (New RandomAccessSourceFactory (.) CreateSource (streamBytes)); While (tokenizer.NextToken ()) {if (tokenizer.TokenType == PRTokeniser.TokType.STRING) {string text = tokenizer. string value; Fortunately, it parses the missing title, but it parsed them first (words in new lines instead of single row) and the value later.  

iTextSharp documentation?

iTextSharp should contain classes, which can find the title / value pair or parcel the title in the least readable format. I'm happy to write about my own implementation of ITextExtractionStrategy .

iTextSharp does not have an official document page, but you can find some answers on the string in Rather than receiving data from PDF, try parsing it as XML and then use XPath to get the data you need. Or you can use Linq for XML I'm guessing that each page of the PDF has the same format, so XML structure can have the same format.

This is a project sample you can use to make SDK (payment), but if you want it free then this is a temporary solution.

Comments

Popular posts from this blog

Verilog Error: output or inout port "Q" must be connected to a structural net expression -

jasper reports - How to center align barcode using jasperreports and barcode4j -

c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed because another entity of the same type already has the same primary key value -