Adding OCR info to a PDF

I have a good quality scan of a document; such scan is in pdf format.

How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.

edited Jun 7 '12 at 10:19

asked Jun 7 '12 at 8:56

fdierre

49831022

Duplicate? askubuntu.com/questions/16268/…

– Jakob
Jun 7 '12 at 9:04

4

@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

– fdierre
Jun 7 '12 at 10:17

How, and what did you use to scan the document?

– Mitch♦
Jun 7 '12 at 11:05

@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

– fdierre
Jun 7 '12 at 12:06

Scanning and/or OCR Software?

– Mitch♦
Jun 7 '12 at 12:18

|
show 2 more comments

I have a good quality scan of a document; such scan is in pdf format.

edited Jun 7 '12 at 10:19

asked Jun 7 '12 at 8:56

fdierre

49831022

Duplicate? askubuntu.com/questions/16268/…

– Jakob
Jun 7 '12 at 9:04

4

@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

– fdierre
Jun 7 '12 at 10:17

How, and what did you use to scan the document?

– Mitch♦
Jun 7 '12 at 11:05

@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

– fdierre
Jun 7 '12 at 12:06

Scanning and/or OCR Software?

– Mitch♦
Jun 7 '12 at 12:18

|
show 2 more comments

I have a good quality scan of a document; such scan is in pdf format.

edited Jun 7 '12 at 10:19

asked Jun 7 '12 at 8:56

fdierre

49831022

I have a good quality scan of a document; such scan is in pdf format.

pdf scanning ocr

edited Jun 7 '12 at 10:19

asked Jun 7 '12 at 8:56

fdierre

49831022

edited Jun 7 '12 at 10:19

asked Jun 7 '12 at 8:56

fdierre

49831022

edited Jun 7 '12 at 10:19

asked Jun 7 '12 at 8:56

fdierre

49831022

asked Jun 7 '12 at 8:56

fdierre

49831022

asked Jun 7 '12 at 8:56

fdierre

49831022

Duplicate? askubuntu.com/questions/16268/…

– Jakob
Jun 7 '12 at 9:04

4

@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

– fdierre
Jun 7 '12 at 10:17

How, and what did you use to scan the document?

– Mitch♦
Jun 7 '12 at 11:05

@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

– fdierre
Jun 7 '12 at 12:06

Scanning and/or OCR Software?

– Mitch♦
Jun 7 '12 at 12:18

|
show 2 more comments

Duplicate? askubuntu.com/questions/16268/…

– Jakob
Jun 7 '12 at 9:04

4

@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

– fdierre
Jun 7 '12 at 10:17

How, and what did you use to scan the document?

– Mitch♦
Jun 7 '12 at 11:05

@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

– fdierre
Jun 7 '12 at 12:06

Scanning and/or OCR Software?

– Mitch♦
Jun 7 '12 at 12:18

Duplicate? askubuntu.com/questions/16268/…

– Jakob
Jun 7 '12 at 9:04

@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.

– fdierre
Jun 7 '12 at 10:17

How, and what did you use to scan the document?

– Mitch♦
Jun 7 '12 at 11:05

@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)

– fdierre
Jun 7 '12 at 12:06

Scanning and/or OCR Software?

– Mitch♦
Jun 7 '12 at 12:18

|
show 2 more comments

6 Answers
6

active

oldest

votes

pdfsandwich

Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:

pdfsandwich scanned.pdf

Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

If you get any error please download last version deb from Sourceforge.

Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

answered Jul 25 '14 at 13:27

Tobias Elze

21923

6

This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

– naught101
Feb 9 '15 at 2:47

Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

– A.B.
Apr 22 '15 at 5:55

Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

– Pablo Bianchi
Mar 9 '17 at 21:46

1

@PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

– zrajm
Jun 20 '17 at 15:44

@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

– Pablo Bianchi
Jun 21 '17 at 18:40

add a comment |

There are two projects which do the trick: GScan2PDF and OCRFeeder

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

answered Jun 7 '12 at 21:24

Aldi

711

add a comment |

I found a non-ideal solution, but a very effective one.

I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.

Thus you can search and copy text from this invisible layer.

enter image description here

answered Feb 19 '13 at 10:31

To Do

8,62194891

add a comment |

For a command line solution, you can use pdfocr.

In brief, install software:

$ sudo apt-get install python-software-properties

$ sudo add-apt-repository ppa:gezakovacs/pdfocr

$ sudo apt-get update

$ sudo apt-get install pdfocr

Then run pdfocr:

$ pdfocr -i scanned.pdf -o scanned.with.search.pdf

That worked for me on Ubuntu 12.04 LTS.

answered Mar 23 '14 at 20:23

Robert Citek

211

6

Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

– jmiserez
Mar 21 '15 at 18:31

add a comment |

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

https://github.com/jbarlow83/OCRmyPDF

answered Nov 8 '17 at 16:47

user127022

211

I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

– Maxim
May 3 '18 at 15:04

add a comment |

This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.

#!/bin/sh -ex



density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given



convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif

parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif

pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress



# Cleanup temp files

rm page_?????.tif page_?????.pdf

answered 7 hours ago

stefanct

1011

New contributor

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f147679%2fadding-ocr-info-to-a-pdf%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

pdfsandwich

Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:

pdfsandwich scanned.pdf

Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

If you get any error please download last version deb from Sourceforge.

Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

answered Jul 25 '14 at 13:27

Tobias Elze

21923

6

This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

– naught101
Feb 9 '15 at 2:47

Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

– A.B.
Apr 22 '15 at 5:55

Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

– Pablo Bianchi
Mar 9 '17 at 21:46

1

@PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

– zrajm
Jun 20 '17 at 15:44

@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

– Pablo Bianchi
Jun 21 '17 at 18:40

add a comment |

pdfsandwich

Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:

pdfsandwich scanned.pdf

Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

If you get any error please download last version deb from Sourceforge.

Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

answered Jul 25 '14 at 13:27

Tobias Elze

21923

6

This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

– naught101
Feb 9 '15 at 2:47

Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

– A.B.
Apr 22 '15 at 5:55

Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

– Pablo Bianchi
Mar 9 '17 at 21:46

1

@PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

– zrajm
Jun 20 '17 at 15:44

@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

– Pablo Bianchi
Jun 21 '17 at 18:40

add a comment |

pdfsandwich

Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:

pdfsandwich scanned.pdf

Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

If you get any error please download last version deb from Sourceforge.

Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

answered Jul 25 '14 at 13:27

Tobias Elze

21923

pdfsandwich

Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:

pdfsandwich scanned.pdf

Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE package) and setting the layout:

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

If you get any error please download last version deb from Sourceforge.

Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

answered Jul 25 '14 at 13:27

Tobias Elze

21923

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

edited Mar 10 '17 at 4:03

Pablo Bianchi

2,4451530

answered Jul 25 '14 at 13:27

Tobias Elze

21923

answered Jul 25 '14 at 13:27

Tobias Elze

21923

answered Jul 25 '14 at 13:27

Tobias Elze

21923

6

This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

– naught101
Feb 9 '15 at 2:47

Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

– A.B.
Apr 22 '15 at 5:55

Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

– Pablo Bianchi
Mar 9 '17 at 21:46

1

@PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

– zrajm
Jun 20 '17 at 15:44

@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

– Pablo Bianchi
Jun 21 '17 at 18:40

add a comment |

6

This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

– naught101
Feb 9 '15 at 2:47

Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

– A.B.
Apr 22 '15 at 5:55

Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

– Pablo Bianchi
Mar 9 '17 at 21:46

1

@PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

– zrajm
Jun 20 '17 at 15:44

@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

– Pablo Bianchi
Jun 21 '17 at 18:40

This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...

– naught101
Feb 9 '15 at 2:47

Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...

– A.B.
Apr 22 '15 at 5:55

Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses pdfunite.

– Pablo Bianchi
Mar 9 '17 at 21:46

@PabloBianchi Is there any way to manual proofreading of the OCRed text using pdfsandwitch? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?

– zrajm
Jun 20 '17 at 15:44

@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!

– Pablo Bianchi
Jun 21 '17 at 18:40

add a comment |

There are two projects which do the trick: GScan2PDF and OCRFeeder

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

answered Jun 7 '12 at 21:24

Aldi

711

add a comment |

There are two projects which do the trick: GScan2PDF and OCRFeeder

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

answered Jun 7 '12 at 21:24

Aldi

711

add a comment |

There are two projects which do the trick: GScan2PDF and OCRFeeder

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

answered Jun 7 '12 at 21:24

Aldi

711

There are two projects which do the trick: GScan2PDF and OCRFeeder

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

answered Jun 7 '12 at 21:24

Aldi

711

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

edited Feb 19 '13 at 10:02

Ashwin Nanjappa

84911327

answered Jun 7 '12 at 21:24

Aldi

711

answered Jun 7 '12 at 21:24

Aldi

711

answered Jun 7 '12 at 21:24

Aldi

711

add a comment |

I found a non-ideal solution, but a very effective one.

I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.

Thus you can search and copy text from this invisible layer.

enter image description here

answered Feb 19 '13 at 10:31

To Do

8,62194891

add a comment |

I found a non-ideal solution, but a very effective one.

I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.

Thus you can search and copy text from this invisible layer.

enter image description here

answered Feb 19 '13 at 10:31

To Do

8,62194891

add a comment |

I found a non-ideal solution, but a very effective one.

I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.

Thus you can search and copy text from this invisible layer.

enter image description here

answered Feb 19 '13 at 10:31

To Do

8,62194891

I found a non-ideal solution, but a very effective one.

I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.

Thus you can search and copy text from this invisible layer.

enter image description here

answered Feb 19 '13 at 10:31

To Do

8,62194891

answered Feb 19 '13 at 10:31

To Do

8,62194891

answered Feb 19 '13 at 10:31

To Do

8,62194891

answered Feb 19 '13 at 10:31

To Do

8,62194891

add a comment |

For a command line solution, you can use pdfocr.

In brief, install software:

$ sudo apt-get install python-software-properties

$ sudo add-apt-repository ppa:gezakovacs/pdfocr

$ sudo apt-get update

$ sudo apt-get install pdfocr

Then run pdfocr:

$ pdfocr -i scanned.pdf -o scanned.with.search.pdf

That worked for me on Ubuntu 12.04 LTS.

answered Mar 23 '14 at 20:23

Robert Citek

211

6

Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

– jmiserez
Mar 21 '15 at 18:31

add a comment |

For a command line solution, you can use pdfocr.

In brief, install software:

$ sudo apt-get install python-software-properties

$ sudo add-apt-repository ppa:gezakovacs/pdfocr

$ sudo apt-get update

$ sudo apt-get install pdfocr

Then run pdfocr:

$ pdfocr -i scanned.pdf -o scanned.with.search.pdf

That worked for me on Ubuntu 12.04 LTS.

answered Mar 23 '14 at 20:23

Robert Citek

211

6

Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

– jmiserez
Mar 21 '15 at 18:31

add a comment |

For a command line solution, you can use pdfocr.

In brief, install software:

$ sudo apt-get install python-software-properties

$ sudo add-apt-repository ppa:gezakovacs/pdfocr

$ sudo apt-get update

$ sudo apt-get install pdfocr

Then run pdfocr:

$ pdfocr -i scanned.pdf -o scanned.with.search.pdf

That worked for me on Ubuntu 12.04 LTS.

answered Mar 23 '14 at 20:23

Robert Citek

211

For a command line solution, you can use pdfocr.

In brief, install software:

$ sudo apt-get install python-software-properties

$ sudo add-apt-repository ppa:gezakovacs/pdfocr

$ sudo apt-get update

$ sudo apt-get install pdfocr

Then run pdfocr:

$ pdfocr -i scanned.pdf -o scanned.with.search.pdf

That worked for me on Ubuntu 12.04 LTS.

answered Mar 23 '14 at 20:23

Robert Citek

211

answered Mar 23 '14 at 20:23

Robert Citek

211

answered Mar 23 '14 at 20:23

Robert Citek

211

answered Mar 23 '14 at 20:23

Robert Citek

211

6

Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

– jmiserez
Mar 21 '15 at 18:31

add a comment |

6

Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

– jmiserez
Mar 21 '15 at 18:31

Github here: github.com/gkovacs/pdfocr. But this has the same issue as pdfsandwich, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.

– jmiserez
Mar 21 '15 at 18:31

add a comment |

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

https://github.com/jbarlow83/OCRmyPDF

answered Nov 8 '17 at 16:47

user127022

211

I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

– Maxim
May 3 '18 at 15:04

add a comment |

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

https://github.com/jbarlow83/OCRmyPDF

answered Nov 8 '17 at 16:47

user127022

211

I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

– Maxim
May 3 '18 at 15:04

add a comment |

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

https://github.com/jbarlow83/OCRmyPDF

answered Nov 8 '17 at 16:47

user127022

211

A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:

https://github.com/jbarlow83/OCRmyPDF

answered Nov 8 '17 at 16:47

user127022

211

answered Nov 8 '17 at 16:47

user127022

211

answered Nov 8 '17 at 16:47

user127022

211

answered Nov 8 '17 at 16:47

user127022

211

I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

– Maxim
May 3 '18 at 15:04

add a comment |

I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

– Maxim
May 3 '18 at 15:04

I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!

– Maxim
May 3 '18 at 15:04

add a comment |

This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.

#!/bin/sh -ex



density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given



convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif

parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif

pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress



# Cleanup temp files

rm page_?????.tif page_?????.pdf

answered 7 hours ago

stefanct

1011

New contributor

add a comment |

This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.

#!/bin/sh -ex



density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given



convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif

parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif

pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress



# Cleanup temp files

rm page_?????.tif page_?????.pdf

answered 7 hours ago

stefanct

1011

New contributor

add a comment |

This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.

#!/bin/sh -ex



density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given



convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif

parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif

pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress



# Cleanup temp files

rm page_?????.tif page_?????.pdf

answered 7 hours ago

stefanct

1011

New contributor

This is my quick and dirty solution based on ImageMagick's convert, tesseract, parallel and pdftk (all available on debian-based distributions). It's largely based on this blog post.

#!/bin/sh -ex



density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given



convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif

parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif

pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress



# Cleanup temp files

rm page_?????.tif page_?????.pdf

answered 7 hours ago

stefanct

1011

New contributor

answered 7 hours ago

stefanct

1011

New contributor

answered 7 hours ago

stefanct

1011

answered 7 hours ago

stefanct

1011

New contributor

stefanct is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Jtdcftul

Adding OCR info to a PDF

6 Answers
6

pdfsandwich

Your Answer

Post as a guest

6 Answers
6

6 Answers
6

pdfsandwich

pdfsandwich

pdfsandwich

pdfsandwich

Post as a guest

Popular posts from this blog

香粉寮

GameSpot

日野市

Adding OCR info to a PDF

6 Answers 6

pdfsandwich

Your Answer

Sign up or log in

Post as a guest

Post as a guest

6 Answers 6

6 Answers 6

pdfsandwich

pdfsandwich

pdfsandwich

pdfsandwich

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

香粉寮

GameSpot

日野市

6 Answers
6

6 Answers
6

6 Answers
6