Pantomime is a Clojure interface to Apache Tika.

Changes between Pantomime 2.4.0 and 2.5.0

Content Extraction API

Pantomime now provdes access to Tika’s content extraction functionality via pantomime.extract/parse:

(require [ :as io]
         [pantomime.extract :as extract])

(pprint (extract/parse "test/resources/pdf/qrl.pdf"))

;= {:producer ("GNU Ghostscript 7.05"),
;=  :pdf:pdfversion ("1.2"),
;=  :dc:title ("main.dvi"),
;=  :dc:format ("application/pdf; version=1.2"),
;=  :xmp:creatortool ("dvips(k) 5.86 Copyright 1999 Radical Eye Software"),
;=  :pdf:encrypted ("false"),
;=  ...
;=  :text "\nQuickly Reacquirable Locks∗\n\nDave Dice Mark Moir ... "
;= }

If extraction fails, extract.parse will return the following:

{:text "",
 :content-type ("application/octet-stream"),
 :x-parsed-by ("org.apache.tika.parser.EmptyParser")}

extract/parse is a simple interface to Tika’s own Parser.parse method.

Contributed by Joshua Thayer.

Change Log

Pantomime change log is available on GitHub.

