voyent
PDF Text Extraction using ICEPDF  XML
Forum Index -> ICEpdf General
Author Message
DivyaKambhatla

Joined: 25/Jan/2011 02:42:33
Messages: 2
Offline


Hi,

I have attached a copy of the PDF from which I am not being able to extract any content except the watermark using ICEPdf. I am not sure if the ICEPdf code given in the developer guide(Extract Text section) requires any more customization/setting any variables.?

I used the bin versions of the 4.1.1 bundle, got the two jars (icepdf-core.jar and icepdf-viewer.jar) from the lib folder and used jdk 1.6 to run the extract text method given in the developer guide which results only in the watermark getting extracted.

 Filename samplepdf.pdf [Disk] Download
 Description
 Filesize 550 Kbytes
 Downloaded:  652 time(s)

patrick.corless

Joined: 26/Oct/2004 00:00:00
Messages: 1982
Offline


Hello;

Thanks for posting the file. It turns out that our sample code for text extraction isn't working very well in this case. The example code calls document.getPageText() which eventually calls page.getText() which is supposed to be optimized for text extraction. However in this case is a little too optimized in that it only parses the watermark text.

The good news is that there is another way to get get a page text that is used by the viewer RI. In this case the page is fully parse which is a little slower then the previous (I'll create an bug for the text extraction error). In our ./examples/extraction/PageTextExtraction.java

replace

Code:
PageText pageText = document.getPageText(pagNumber);


with

Code:
            Object pageLock = new Object();
             PageTree pageTree = document.getPageTree();
             Page pg = pageTree.getPage(pagNumber, pageLock);
             PageText pageText = pg.getViewText();
             pageTree.releasePage(pg, pageLock);


The extraction preserves the columns but always has problems with justified text layout as it's really hard to get the spacing right. If this is problem there is a system property org.icepdf.core.views.page.text.spaceFraction=3 which can be tweaked to help detect words. A value of zero does no space insertion where as a larger number will try to factionallly (based on the average glyph width) add more spaces.
[Email]
 
Forum Index -> ICEpdf General
Go to:   
Powered by JForum 2.1.7ice © JForum Team