Wednesday, November 6, 2013

Removing dependency on jQuery

Recently I've been working on removing jQuery from pdf2htmlEX.js.

jQuery was initially introduced as a handy cross-platform JavaScript syntax. Until recently, I didn't realized that IE(>=9) had already implemented most standard JavaScript APIs. I was still thinking about attachEvent vs addEventListener at that time.

A few days ago when I was trying to optimizing JS code, I found jQuery is really large (~90K minimized), while all other js and css files are no bigger than ~10K in total. Also lots of results from show that jQuery is really slow.

Currently most jQuery functions have been replaced with standard JavaScript APIs, except for $.extend and $.ajax, which will be worked out soon. Another exception is Element.classList which is not implemented in IE9, but I've copied a JS snippet from PDF.js for that.

I can indeed feel a boost on performance after the removal of jQuery, but the modification is likely to cause regression, although I've checked for each API I used.  I've tested the code on Firefox/Chrome on Linux, and I'll test on others as well.

Target browsers:
- IE >= 9
- Recent versions of Firefox / Chrome / Safari

Please help test (the git version of) pdf2htmlEX with your browser and file bugs if any. Thanks!

Thursday, October 17, 2013

pdf2htmlEX v0.10 is out

pdf2htmlEX v0.10 is out, bringing experimental support for SVG image and Type 3 fonts.


* Lots of code cleaning
* Logo as loading indicator
* Add a logo
* Remove several CSS prefixes
* Background image optimization
* Support output background image in JPEG (--bg-format jpg)
* [Experimental] Support output background image in SVG (--bg-format svg)
* [Experimental] Support Type 3 fonts
* New options
--font-format (same as --font-suffix, but without the leading dot)

* Deprecated options:

Thursday, September 26, 2013

pdf2htmlEX got a logo

I managed to craft a logo with Inkscape, which is basically an emblem of "<pdf>". Perhaps it is not of much use, but I just hope that it can help visualizing the concept.

The images are located in the logo/ folder, all of them are licensed under CC-BY 3.0.

Friday, September 20, 2013

Preliminary support for Type 3 fonts

I'm happy to announce that a preliminary support for type 3 fonts has been added to pdf2htmlEX. For now 2 simple PDFs from PDF.js are passed:

This feature is actually one of the features that I want to implement the most, since the very beginning. Another one is generating background images in SVG, a preliminary version of which has also just been added.

Both features rely on CairoOutputDev from poppler, which further replies on cairo and freetype. Actually it might be possible to eliminate the dependency on freetype, but I don't want to touch those files in order to make it easier to merge upstream files in the future. Anyway seems that freetype is depended by poppler, so no big deal.

To enable this feature, you need the latest source code from git. Add `-DENABLE_SVG=ON` to cmake, and `--process-type3=1` when running pdf2htmlEX.

The current idea is, for each type 3 font, to dump each glyph into an SVG image and then combine them into a font with FontForge. It's actually inspired by FontCustom, I realized the capability of importing SVG glyphs of FontForge by reading the code of FontCustom.

Each glyph is drawn on a 100x100 canvas, although SVG is for vector graphics, CairoOutputDev would thicken thin strokes (for printing purpose?), which might ruin the font. Also there are cases that sampled raster images are stored in the SVG file, probably it is the behaviour of cairo due to the limitations of SVG. In such cases, 100x100 might not be large enough for a font.

The size is defined as GLYPH_DUMP_EM_SIZE in I tried to set it to 1000, and indeed the quality for `issue3188.pdf` was improved; but for some other PDF files, the values in SVG files might be so large that FontForge would complain that those values cannot be stored into 16-bit fields. Or maybe it is the problem of TTF, and I'd better change it to another.

However due to the complexity of Type 3 fonts, (each glyph is a mini-PDF), especially the font matrix, I don't have a perfect solution for each possible cases. Right now let me just focus on `average` cases.

Wednesday, September 18, 2013

Preliminary SVG support

A preliminary SVG support has been implemented, powered by CairoOutputDev from poppler.

Since CairoOutputDev is not exposed by poppler, I have to maintain a copy of a few files inside pdf2htmlEX. Also cairo and freetype are required for this feature. This feature can be enable/disabled by the ENABLE_SVG cmake
A new option `--bg-format` has also been added, to specify the format for the background images. Currently only 'png' and 'svg' are supported.

(This is also a test for auto forwarding blog posts to the mailing list)

Sunday, September 15, 2013

pdf2htmlEX v0.9 released

pdf2htmlEX v0.9 is released. This version includes several bug fixes, and not much new features.


* Lazy loading of pages
* Show font names in debug messages
* Licensed changed
 - Additional terms for usage in online services
 - Remove GPLv2
* Bug fixes:
 - --optimize-text
 - Always use Unicode encoding for fonts
 - space width
 - disable ligature in Firefox
 - Uninitialized memory for encoding
* New options:
* Deprecated/Removed options:

Features planned in v0.10
- Preliminary optimization for raster images
- Preliminary support for SVG background.

Sunday, September 1, 2013

Development Log

I've been quite busy since the last article, which will be still going on for a while, so please forgive me if I'm not quite responsive. Just feel free to poke me by email if I have not reply your message in 3 days.

I had planned to add image optimizations in v0.9, while there are APIs in poppler to detect the area of changes, there is not such convenient APIs for outputting (a partial of the background), I might still need to work on this, the ideal way is to avoid bring more modified poppler code back to pdf2htmlEX (then maintain them).

Recently there are pull requests about key shortcuts and UI events. (Thank you!)
But they required to inject global handlers in the web pages, which might not be desirable for some users who need only nothing but HTML pages. I'm now considering to create a new mode, say standalone mode, which means the user is intended to use the complete package produced by pdf2htmlEX. In this way we can add more UI features without affect other users. I'm still thinking about the logic, whether this should be a new set of files (manifect and other files) or simply a switch in JS.

Friday, May 10, 2013

Removal of GPLv2

Thanks to John, who pointed it out that the additional term is not compatible with GPLv2. Also as I have checked again, FontForge is now released under GPLv3+, so now (most parts of) pdf2htmlEX is released under only GPLv3 (with additional terms).

Licence Changed

Recently I'm changing the license of pdf2htmlEX, but most of you should not be worried.

As you know pdf2htmlEX is released under GPLv2 or GPLv3, with a few files released under the MIT License. GPL does not protect the source code for usage in online services, as AGPL does.

I don't think it would be necessary to apply AGPL, since a wise service provider should have realized that making their modifications public is an advantage to themselves, and indeed I've received serveral patches from service providers.

Unlike most GPL softwares, pdf2htmlEX is designed for service providers. I expect the common use to be customization for different services instead of redistributions. But I don't want it to end up with lots of wrappers without any feedback. So in order to let more people know about this technology, and to attract more feedbacks, recently I added a new term in the license:

If you want to use pdf2htmlEX (or your modified version) in your online service,
through which a user can provide one or more files through a computer network,
and view any part of the result produced by pdf2htmlEX (or your modified version) on the file(s) provided by the user, you should credit pdf2htmlEX with a proper link to in the page where the result is presented, or the homepage of your service, or a page directly accessible from the homepage of your service.
Any derivate works should also include this term.
For example, you should credit pdf2htlmEX if pdf2htmlEX (or your modified version) must be called after a user upload some files and before the user can see the result of the files, you should credit pdf2htmlEX.

Here are a few explanations:
This term applies if your service involves "online conversion", which means the services allow users to upload files and view the conversion by pdf2htmlEX (or your modification). This terms do not apply if you convert documents of your own and present them online -- but still you are encouraged to credit pdf2htmlEX.

Three locations are mentioned in the term:
  1. The page where you present the converted document
  2. The homepage
  3. A page directly accesible from the homepage
The first two should be intuitive. But I may want a clean homepage if I were the UI designer, and I do not want to see the ugly logo of those document-embedding plugins, therefore I stated the 3rd one. I expect it to be the About page, the Acknowledgement page, or a page where you list technogies used in your service.

If you have done this, you are encouraged to send me the name of your service and a url to where pdf2htmlEX is creditted. This is for the purpose of statistics, and in the future I may create a list of 'sites that use pdf2htmlEX'.

I'm not a lawyer, and I don't know how this is achieved in other softwares. I just want to express my thoughts in this term, hopefully which is clearly explained in this post.

Please tell me what do you think about this additional term!

Monday, May 6, 2013

pdf2htmlEX v0.8.1 is out

Download here

If you download and install this version, `pdf2htmlEX -v` actually shows v0.9, this is due to my naive git branch model. Please contact me if this is too annoying to you. I'll fix it.

This is a quick fix for v0.8, except for `--optimize-text` is turned off by default. This parameter turns out to be still buggy, I'll try to fix it in the next release.

The next release will be focused on optimization, mainly about background images. The idea is to use a number of rectangle to cover occupied areas, instead of a big-whole image for each page. The second step would be combine the rectangles together, and use something like CSS sprite or CSS clipping in order to reduce the number of requests.


Sunday, May 5, 2013

pdf2htmlEX hits Top Trending Repos

As of May 6th, 2013, pdf2htmlEX hits the Monthly Trending Repos at GitHub:
It has also become the top daily & weekly trending repo. Didn't see this coming!

I realized that it might be necessary to start a blog sharing news, technical and non-technical stuffs about pdf2htmlEX. And here it is.

pdf2htmlEX (, just as its name, converts PDF into HTML. How does it work? Let the demos speak:

Many people wonder why they should ever convert PDF to HTML. A short answer is they should not, because they are viewers. While this tool is designed for publishers.

This is an era of Web. For many people, the Internet = the World Wide Web. When not at work, I rarely let my screen occupied by any window except for a browser. What else can you not do with a browser? I like web pages, they have become more and more elegant, but yet powerful (to use) and simple (to compose).

Despite of the development of HTML/CSS/JavaScript, what is your experience with reading PDF files online? Although PDF is always the first choice for any cases involving printing, and no need to mention LaTeX users. When you put an 'online' afterwards, I'd say terrible. Years ago, online PDF reading means
ugly, insecure, unstable and slow plugins that never releases my keyboard & mouse focus. And now browsers have started to implement their own built-in PDF viewers -- PDF is so popular, while the plugins are so not good, that Web browsers have to do this to comfort users.

Another thing I like in web pages, but not in PDF files, is about interaction, quick example: links, on Wikipedia, you may receive a rather smooth information flow while your cursor dancing among the links. Not to mention all kinds of CSS/JavaScript tricks that amaze you. The key is that everything is accessible. PDF, on the other hand, is more like a blackbox, or an <iframe>, it does have many features, but you (the hosting web page or the browser) never know what's going on inside.

This is not fair since PDF is never designed for this. But the idea is that the web technologies are powerful enough to render PDF files, and people need this -- see Crocodoc and SlideShare. pdf2htmlEX works as a bridge, and the target is turn 'Everything to PDF' into 'Everything to Web', just imagine:

  • Your careful designed resume can be published online with Google Analytics embedded.
  • Your slides can be shown online with all kinds of CSS/JavaScript eye candies.
  • PDF documents never make your web sites ugly.

Hopefully some day in the future, we will not be able to tell HTML from PDF by their appearances, just like we cannot tell JPEG from PNG. (or can you?)