Extracting Text

Table of Contents


Text extraction is possible for most PDF documents. There are, however, some limitations with how a document text is encoded and the type of font used to render the text.

Note
If a document is encrypted, the document permissions should be checked to make sure that content extraction is allowed.

The following code demonstrates how to extract text from the first page of a PDF document.  The call to document.getPageText(..) returns the PageText data structure which contains child LineText, WordText and GlyphText.  A call to pageText.toString() will return all the text for the current page.

try {   
   // load the file
   URL documentURL = new URL("your url");
   Document document = new Document();
   document.setUrl( documentURL);
   
   // create an output file
   FileOutputStream fileOutputStream = new FileOutputStream( "extracted.txt");
   PageText pageText = document.getPageText(0);
   if (pageText != null && pageText.getPageLines() != null) {
      fileOutputStream.write(pageText.toString().getBytes());
   }
   fileOutputStream.close();
} catch (Throwable e) {
   e.printStackTrace();
} finally {
   // clean up the document resources document.dispose();
}



Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

© Copyright 2017 ICEsoft Technologies Canada Corp.