TL;DR

Pantomime is a Clojure interface to Apache Tika.

Changes between Pantomime 2.4.0 and 2.5.0

Content Extraction API

Pantomime now provdes access to Tika’s content extraction functionality via pantomime.extract/parse:

(require [clojure.java.io :as io]
         [pantomime.extract :as extract])

(pprint (extract/parse "test/resources/pdf/qrl.pdf"))

;= {:producer ("GNU Ghostscript 7.05"),
;=  :pdf:pdfversion ("1.2"),
;=  :dc:title ("main.dvi"),
;=  :dc:format ("application/pdf; version=1.2"),
;=  :xmp:creatortool ("dvips(k) 5.86 Copyright 1999 Radical Eye Software"),
;=  :pdf:encrypted ("false"),
;=  ...
;=  :text "\nQuickly Reacquirable Locks∗\n\nDave Dice Mark Moir ... "
;= }

If extraction fails, extract.parse will return the following:

{:text "",
 :content-type ("application/octet-stream"),
 :x-parsed-by ("org.apache.tika.parser.EmptyParser")}

extract/parse is a simple interface to Tika’s own Parser.parse method.

Contributed by Joshua Thayer.

Change Log

Pantomime change log is available on GitHub.

Pantomime is a ClojureWerkz Project

Pantomime is part of the group of libraries known as ClojureWerkz, together with

  • Langohr, a Clojure client for RabbitMQ that embraces the AMQP 0.9.1 model
  • Elastisch, a minimalistic Clojure client for ElasticSearch
  • Cassaforte, a Clojure Cassandra client built around CQL
  • Monger, a Clojure MongoDB client for a more civilized age
  • Welle, a Riak client with batteries included
  • Neocons, a client for the Neo4J REST API
  • Quartzite, a powerful scheduling library

and several others. If you like Pantomime, you may also like our other projects.

Let us know what you think on Twitter or on the Clojure mailing list.

Michael on behalf of the ClojureWerkz Team