tommorris.org

Discussing software, the web, politics, sexuality and the unending supply of human stupidity.


pdf


Towards an Evernote replacement

Since the recent announcements by Evernote that they really, really will be able to poke around inside your notebooks without issue, and they’ll also apply the same sort of machine learning technology to your data that people have convinced themselves that paying for a product will help them avoid, lots of people have been looking at alternatives to Evernote. I’ve been evaluating a few of them.

Here’s some of the open source alternatives I’ve seen people talk about:

  • Laverna, a client-side only JavaScript app that uses localStorage for your notes. No syncing, sadly.
  • Paperwork, PHP serverside app
  • Standard Notes, which aims to be an encrypted protocol for storing Evernote-style notes

These all handle the plain text (or Markdown or whatever) case reasonably well, but there’s a few things Evernote provides which we should be aware of if we’re trying to find replacements.

  1. text/plain or RTF storage. A lot of people store a lot of simple text notes in Evernote.
  2. OCRed PDF storage. Evernote has an app called Scannable that makes it ludicrously easy to scan a lot of documents and store them in Evernote.
  3. Web Clipper: I don’t use this, but a lot of people use Evernote as a kind of bookmarking service using the Web Clipper plugin that they provide. If they see a news article or recipe or whatever on the web, they clip it out and store it in Evernote and use that almost like a reading list, like what people use Instapaper/Pocket for etc.

The solutions people have been building generally solve problem (1) but do little to address problems (2) and (3).

My own preferred solutions are basically this: for (1), I’m leaning towards just storing and syncing together plain text Markdown files of some description.

Solving (2) is a harder problem. My current plan is to try and create a way to store all these in email. Email is a pretty reliable, well-tested and widely implemented Everything Bucket. The process would be relatively simple: scan document, run it through an OCR process, then provide the relevant title and tags which could be stored in the subject line and body of the email. The OCR result would also be stored in the body of the email to make it more easily searchable. Then you just stick it all in an email folder (or Gmail label). You’ve got a security layer (whatever your email provides, and if you are storing lots of important data in there, you should probably ensure it is 2FA). You’ve got sync (IMAP). You’ve got universal capture (email it to yourself). And you have already made either the financial bargain (with, say, Fastmail) or the give-away-all-your-personal-information bargain (Gmail). Backing up IMAP is relatively trivial compared to backing up whatever weird binary blob format people come up with.

Solving (3) is somebody else’s problem because I don’t understand why anyone wants to stick all the websites they’ve ever visited into Evernote.

That said, let’s not promise users replacement for software if we are only replicating the features from that software that we actually use. If anyone has great suggestions for how they are going to sort out problem (2), I’m all ears.


Conference on hypertext asks for submissions in PDF only

HyperText 2015 (bold mine):

The ACM Conference on Hypertext and Social Media (HT) is a premium venue for high quality peer-reviewed research on theory, systems and applications for hypertext and social media. It is concerned with all aspects of modern hypertext research, including social media, adaptation and personalisation, user modeling, linked data and semantic web, dynamic and computed hypertext, and its application in digital humanities.

HT2015 will focus on the role of hypertext and hyperlink theory on the web and beyond, as a foundation for approaches and practices in the wider community.

Submission Instructions for HyperText 2015:

All submissions should be formatted according to the official ACM SIG proceedings template and submitted in PDF format

So much lack of self-awareness.


Unprintable PDFs won't save the planet

I was doing some new page patrolling on Wikipedia today, and I came across this article on the WWF file format.

In amongst all the Australian rules football players, Bollywood movies and obscure jazz albums that one goes through when new page patrolling, this stuck out.

It’s a file format designed by the World Wide Fund for Nature (what used to be the World Wildlife Fund) as an environmentally-conscious document format.

Wha—? An environmentally-conscious document format? That’s an object and a property that you don’t see going together if you are sane. It makes about as much sense as saying your yoghurt is low voltage or your toothpaste is skintight.

An environmentally-conscious document is one that actively prevents you from printing it. It turns out a .WWF file is just a PDF file with “don’t print” security flag toggle on.

Which is all well and good, but what if you want to print the document? PDF security flags are somewhere between one of those disable-right-click scripts you find on web pages from beautiful and unique snowflake artist types (and a few porn sites) who don’t want their precious JPEGs stolen and rubbish DRM systems, which is to say most of them. A damn security flag will not stop anyone. Especially as almost all PDF viewer applications except the default Adobe ones have been built in such a way to completely ignore them.

This format is completely pointless. You wouldn’t believe the number of people I see who encounter a file type they don’t understand and instead of showing some initiative and finding an application that might be able to open it just give up. There’s some special pill they give computer users these days which makes them have absolutely no initiative at all and act like damsels in distress. When someone downloads a ‘WWF’ file, they are just going to say “what the fuck is this and why isn’t Adobesoft Crapware 3000 opening it for me?”

Or if they are technically competent, they’ll go and read the Wikipedia page and see “ah, it’s a PDF file”. Then they’ll open it up in an open source PDF reader, remove the security flags and save it as a real PDF file with a “.pdf” extension and use that instead. And they’ll probably print fifty copies of the document just out of spite (or so that they can force the inventor of this idiotic idea eat a few dozen sheets of laserprint if they are ever so unlucky to meet).

So this is confusing for non-technical people and annoying for technical people. And completely ineffective at the stated goal of getting people to print less things out.

What does it do? Shows the WWF as a bunch of technically inept wackaloons and dramatically reduce the likelihood of anyone who encounters this silly idea to ever consider contributing to WWF, even if they are an environmentally-concerned hippie (hint: a lot of geeks are).

Here’s one way to help: stop printing newsletters. Ask your members if they are willing to put up with getting an e-mail newsletter instead of a printed magazine. Get other non-profits and companies to do likewise. Maybe set up a service to manage e-mail newsletter subscriptions for other non-profits.

Turning yourself into a bad joke amongst geeks is no way to save the environment.

If you are a charity or campaigning group and are planning something like this, find someone with some knowledge of computing and send them an e-mail first.


Connecting to GoodReader as a network folder is super handy. I’ve now set up GoodReader on my iPad to sync with my computer using this method to mount it, then using Unison to sync. This is all controlled using a Rake script. Here’s the code.


Splitting PDFs vertically to turn double-page into single-page

I've been looking for a long time trying to find a script to turn double page scanned PDFs into single page PDFs, specifically so I can read them on my iPod Touch using the excellent GoodReader.

I sat down to write a question about this on Super User. But in the process of doing so, I managed to formulate the perfect Google query - 'split double page pdf' (no quotes). I downloaded this Perl script, installed the PDF::API2 CPAN module and it just worked as expected.

Before finding this, I investigated a whole load of commercial apps, including Acrobat and a whole bunch of $49 Windows apps promoted by having stock photos of efficient-looking business guys in suits. And because that whole area has a bad whiff of conman tactics about it, I started investigating using iText, the Java PDF manipulation library, to write my own. A thirty-line Perl script does the job much better though.

There is only one slight downside I've found - it does rather inflate the file sizes. I had a 17Mb PDF file, which I then split using the Perl script. It's now 36Mb. It doesn't bother me too much, but it seems like it might be the case that the cropped data is only being hidden away rather than actually removed from the file, then duplicated on the following page. I also tend to have to open the finished file in Preview.app and remove the odd page or two. I am tempted to modify the Perl script to include some 'ignore' pages. Basically, what it would need to do is to do a quick statistical run through of the pages before hand and check the sizing of the pages - so the odd few pages at the start which are funnily sized (title pages etc.) don't get treated the same way as the rest of the document.

Anyway, I can now read various scanned papers on my iPod, so that's all good. I hope others in similar predicaments can use this to make their PDFs readable on small devices like iPods.