Adding OCR info to a PDF
I have a good quality scan of a document; such scan is in pdf format.
How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.
pdf scanning ocr
|
show 2 more comments
I have a good quality scan of a document; such scan is in pdf format.
How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.
pdf scanning ocr
Duplicate? askubuntu.com/questions/16268/…
– Jakob
Jun 7 '12 at 9:04
4
@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.
– fdierre
Jun 7 '12 at 10:17
How, and what did you use to scan the document?
– Mitch♦
Jun 7 '12 at 11:05
@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)
– fdierre
Jun 7 '12 at 12:06
Scanning and/or OCR Software?
– Mitch♦
Jun 7 '12 at 12:18
|
show 2 more comments
I have a good quality scan of a document; such scan is in pdf format.
How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.
pdf scanning ocr
I have a good quality scan of a document; such scan is in pdf format.
How can I add ocr information to the pdf, so that it becomes searchable? By searchable I mean that the goal is that when viewing the pdf with evince, CTRL-F actually allows me to search in the pdf content.
pdf scanning ocr
pdf scanning ocr
edited Jun 7 '12 at 10:19
fdierre
asked Jun 7 '12 at 8:56
fdierrefdierre
49831022
49831022
Duplicate? askubuntu.com/questions/16268/…
– Jakob
Jun 7 '12 at 9:04
4
@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.
– fdierre
Jun 7 '12 at 10:17
How, and what did you use to scan the document?
– Mitch♦
Jun 7 '12 at 11:05
@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)
– fdierre
Jun 7 '12 at 12:06
Scanning and/or OCR Software?
– Mitch♦
Jun 7 '12 at 12:18
|
show 2 more comments
Duplicate? askubuntu.com/questions/16268/…
– Jakob
Jun 7 '12 at 9:04
4
@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.
– fdierre
Jun 7 '12 at 10:17
How, and what did you use to scan the document?
– Mitch♦
Jun 7 '12 at 11:05
@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)
– fdierre
Jun 7 '12 at 12:06
Scanning and/or OCR Software?
– Mitch♦
Jun 7 '12 at 12:18
Duplicate? askubuntu.com/questions/16268/…
– Jakob
Jun 7 '12 at 9:04
Duplicate? askubuntu.com/questions/16268/…
– Jakob
Jun 7 '12 at 9:04
4
4
@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.
– fdierre
Jun 7 '12 at 10:17
@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.
– fdierre
Jun 7 '12 at 10:17
How, and what did you use to scan the document?
– Mitch♦
Jun 7 '12 at 11:05
How, and what did you use to scan the document?
– Mitch♦
Jun 7 '12 at 11:05
@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)
– fdierre
Jun 7 '12 at 12:06
@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)
– fdierre
Jun 7 '12 at 12:06
Scanning and/or OCR Software?
– Mitch♦
Jun 7 '12 at 12:18
Scanning and/or OCR Software?
– Mitch♦
Jun 7 '12 at 12:18
|
show 2 more comments
6 Answers
6
active
oldest
votes
pdfsandwich
Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:
pdfsandwich scanned.pdf
Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE
package) and setting the layout:
pdfsandwich -verbose -lang spa -layout single scanned.pdf
If you get any error please download last version deb from Sourceforge.
Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.
6
This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...
– naught101
Feb 9 '15 at 2:47
Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...
– A.B.
Apr 22 '15 at 5:55
Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 usespdfunite
.
– Pablo Bianchi
Mar 9 '17 at 21:46
1
@PabloBianchi Is there any way to manual proofreading of the OCRed text usingpdfsandwitch
? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?
– zrajm
Jun 20 '17 at 15:44
@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!
– Pablo Bianchi
Jun 21 '17 at 18:40
add a comment |
There are two projects which do the trick: GScan2PDF and OCRFeeder
add a comment |
I found a non-ideal solution, but a very effective one.
I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.
Thus you can search and copy text from this invisible layer.
add a comment |
For a command line solution, you can use pdfocr.
In brief, install software:
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:gezakovacs/pdfocr
$ sudo apt-get update
$ sudo apt-get install pdfocr
Then run pdfocr:
$ pdfocr -i scanned.pdf -o scanned.with.search.pdf
That worked for me on Ubuntu 12.04 LTS.
6
Github here: github.com/gkovacs/pdfocr. But this has the same issue aspdfsandwich
, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.
– jmiserez
Mar 21 '15 at 18:31
add a comment |
A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:
https://github.com/jbarlow83/OCRmyPDF
I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!
– Maxim
May 3 '18 at 15:04
add a comment |
This is my quick and dirty solution based on ImageMagick's convert
, tesseract
, parallel
and pdftk
(all available on debian-based distributions). It's largely based on this blog post.
#!/bin/sh -ex
density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given
convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress
# Cleanup temp files
rm page_?????.tif page_?????.pdf
New contributor
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f147679%2fadding-ocr-info-to-a-pdf%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
pdfsandwich
Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:
pdfsandwich scanned.pdf
Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE
package) and setting the layout:
pdfsandwich -verbose -lang spa -layout single scanned.pdf
If you get any error please download last version deb from Sourceforge.
Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.
6
This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...
– naught101
Feb 9 '15 at 2:47
Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...
– A.B.
Apr 22 '15 at 5:55
Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 usespdfunite
.
– Pablo Bianchi
Mar 9 '17 at 21:46
1
@PabloBianchi Is there any way to manual proofreading of the OCRed text usingpdfsandwitch
? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?
– zrajm
Jun 20 '17 at 15:44
@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!
– Pablo Bianchi
Jun 21 '17 at 18:40
add a comment |
pdfsandwich
Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:
pdfsandwich scanned.pdf
Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE
package) and setting the layout:
pdfsandwich -verbose -lang spa -layout single scanned.pdf
If you get any error please download last version deb from Sourceforge.
Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.
6
This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...
– naught101
Feb 9 '15 at 2:47
Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...
– A.B.
Apr 22 '15 at 5:55
Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 usespdfunite
.
– Pablo Bianchi
Mar 9 '17 at 21:46
1
@PabloBianchi Is there any way to manual proofreading of the OCRed text usingpdfsandwitch
? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?
– zrajm
Jun 20 '17 at 15:44
@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!
– Pablo Bianchi
Jun 21 '17 at 18:40
add a comment |
pdfsandwich
Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:
pdfsandwich scanned.pdf
Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE
package) and setting the layout:
pdfsandwich -verbose -lang spa -layout single scanned.pdf
If you get any error please download last version deb from Sourceforge.
Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.
pdfsandwich
Does what you want and provides Ubuntu deb packages. It uses tesseract as OCR engine. The following call adds the text layer to your scanned PDF:
pdfsandwich scanned.pdf
Following does the same but with another language (ISO 639-2 code, download tesseract-ocr-LANGCODE
package) and setting the layout:
pdfsandwich -verbose -lang spa -layout single scanned.pdf
If you get any error please download last version deb from Sourceforge.
Disclaimer: I'm the developer of pdfsandwich and therefore obviously biased.
edited Mar 10 '17 at 4:03
Pablo Bianchi
2,4451530
2,4451530
answered Jul 25 '14 at 13:27
Tobias ElzeTobias Elze
21923
21923
6
This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...
– naught101
Feb 9 '15 at 2:47
Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...
– A.B.
Apr 22 '15 at 5:55
Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 usespdfunite
.
– Pablo Bianchi
Mar 9 '17 at 21:46
1
@PabloBianchi Is there any way to manual proofreading of the OCRed text usingpdfsandwitch
? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?
– zrajm
Jun 20 '17 at 15:44
@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!
– Pablo Bianchi
Jun 21 '17 at 18:40
add a comment |
6
This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...
– naught101
Feb 9 '15 at 2:47
Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...
– A.B.
Apr 22 '15 at 5:55
Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 usespdfunite
.
– Pablo Bianchi
Mar 9 '17 at 21:46
1
@PabloBianchi Is there any way to manual proofreading of the OCRed text usingpdfsandwitch
? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?
– zrajm
Jun 20 '17 at 15:44
@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!
– Pablo Bianchi
Jun 21 '17 at 18:40
6
6
This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...
– naught101
Feb 9 '15 at 2:47
This is really great, thank you. However, it appears to modify the images, looks like it runs an unsharp mask over them or something. Is there a way to leave the images exactly as they previously were? In my particular instance, the filter even managed to remove the bar from a couple of fractions in some equations. Everything else works pretty well though...
– naught101
Feb 9 '15 at 2:47
Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...
– A.B.
Apr 22 '15 at 5:55
Bad quality package: `Lintian check results for /tmp/pdfsandwich_0.1.3_amd64.deb: E: pdfsandwich: control-file-has-bad-permissions md5sums 0664 != 0644 E: pdfsandwich: control-file-has-bad-owner md5sums james/james != root/root E: pdfsandwich: wrong-file-owner-uid-or-gid usr/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/ 1000/1000 E: pdfsandwich: wrong-file-owner-uid-or-gid usr/bin/pdfsandwich ...
– A.B.
Apr 22 '15 at 5:55
Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses
pdfunite
.– Pablo Bianchi
Mar 9 '17 at 21:46
Download last version deb from SF. If you get an error at the end might be related to ghostscript (v0.1.4). Now v0.1.6 uses
pdfunite
.– Pablo Bianchi
Mar 9 '17 at 21:46
1
1
@PabloBianchi Is there any way to manual proofreading of the OCRed text using
pdfsandwitch
? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?– zrajm
Jun 20 '17 at 15:44
@PabloBianchi Is there any way to manual proofreading of the OCRed text using
pdfsandwitch
? I'm doing this with some Swedish documents, and it works well, except for some misspellings (probably because of the original's font) which would be easy to fix if it was a text file, but how can I do this in the resulting PDF?– zrajm
Jun 20 '17 at 15:44
@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!
– Pablo Bianchi
Jun 21 '17 at 18:40
@zrajm you can use some of pdfsandwich parameters for better recognition on OCR step. To edit hidden text behind image a PDF you can just edit text boxs layer with LibreOffice Draw, Inkscape or any PDF editing tool. If you find a better way please post it here. DaH jImej!
– Pablo Bianchi
Jun 21 '17 at 18:40
add a comment |
There are two projects which do the trick: GScan2PDF and OCRFeeder
add a comment |
There are two projects which do the trick: GScan2PDF and OCRFeeder
add a comment |
There are two projects which do the trick: GScan2PDF and OCRFeeder
There are two projects which do the trick: GScan2PDF and OCRFeeder
edited Feb 19 '13 at 10:02
Ashwin Nanjappa
84911327
84911327
answered Jun 7 '12 at 21:24
AldiAldi
711
711
add a comment |
add a comment |
I found a non-ideal solution, but a very effective one.
I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.
Thus you can search and copy text from this invisible layer.
add a comment |
I found a non-ideal solution, but a very effective one.
I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.
Thus you can search and copy text from this invisible layer.
add a comment |
I found a non-ideal solution, but a very effective one.
I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.
Thus you can search and copy text from this invisible layer.
I found a non-ideal solution, but a very effective one.
I use PDF X-Change Viewer through Wine. It has an OCR feature which adds a text layer to the existing image-based pdf.
Thus you can search and copy text from this invisible layer.
answered Feb 19 '13 at 10:31
To DoTo Do
8,62194891
8,62194891
add a comment |
add a comment |
For a command line solution, you can use pdfocr.
In brief, install software:
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:gezakovacs/pdfocr
$ sudo apt-get update
$ sudo apt-get install pdfocr
Then run pdfocr:
$ pdfocr -i scanned.pdf -o scanned.with.search.pdf
That worked for me on Ubuntu 12.04 LTS.
6
Github here: github.com/gkovacs/pdfocr. But this has the same issue aspdfsandwich
, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.
– jmiserez
Mar 21 '15 at 18:31
add a comment |
For a command line solution, you can use pdfocr.
In brief, install software:
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:gezakovacs/pdfocr
$ sudo apt-get update
$ sudo apt-get install pdfocr
Then run pdfocr:
$ pdfocr -i scanned.pdf -o scanned.with.search.pdf
That worked for me on Ubuntu 12.04 LTS.
6
Github here: github.com/gkovacs/pdfocr. But this has the same issue aspdfsandwich
, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.
– jmiserez
Mar 21 '15 at 18:31
add a comment |
For a command line solution, you can use pdfocr.
In brief, install software:
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:gezakovacs/pdfocr
$ sudo apt-get update
$ sudo apt-get install pdfocr
Then run pdfocr:
$ pdfocr -i scanned.pdf -o scanned.with.search.pdf
That worked for me on Ubuntu 12.04 LTS.
For a command line solution, you can use pdfocr.
In brief, install software:
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:gezakovacs/pdfocr
$ sudo apt-get update
$ sudo apt-get install pdfocr
Then run pdfocr:
$ pdfocr -i scanned.pdf -o scanned.with.search.pdf
That worked for me on Ubuntu 12.04 LTS.
answered Mar 23 '14 at 20:23
Robert CitekRobert Citek
211
211
6
Github here: github.com/gkovacs/pdfocr. But this has the same issue aspdfsandwich
, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.
– jmiserez
Mar 21 '15 at 18:31
add a comment |
6
Github here: github.com/gkovacs/pdfocr. But this has the same issue aspdfsandwich
, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.
– jmiserez
Mar 21 '15 at 18:31
6
6
Github here: github.com/gkovacs/pdfocr. But this has the same issue as
pdfsandwich
, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.– jmiserez
Mar 21 '15 at 18:31
Github here: github.com/gkovacs/pdfocr. But this has the same issue as
pdfsandwich
, as it modifies/compresses PDFs containing highres images, basically destroying some of the original image information.– jmiserez
Mar 21 '15 at 18:31
add a comment |
A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:
https://github.com/jbarlow83/OCRmyPDF
I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!
– Maxim
May 3 '18 at 15:04
add a comment |
A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:
https://github.com/jbarlow83/OCRmyPDF
I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!
– Maxim
May 3 '18 at 15:04
add a comment |
A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:
https://github.com/jbarlow83/OCRmyPDF
A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:
https://github.com/jbarlow83/OCRmyPDF
answered Nov 8 '17 at 16:47
user127022user127022
211
211
I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!
– Maxim
May 3 '18 at 15:04
add a comment |
I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!
– Maxim
May 3 '18 at 15:04
I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!
– Maxim
May 3 '18 at 15:04
I have had excellent results with your script. Unlike pdfocr by Geza Kovacs, it does not require any extra (hard to compile in some Linux distros!) libraries. Thank you!
– Maxim
May 3 '18 at 15:04
add a comment |
This is my quick and dirty solution based on ImageMagick's convert
, tesseract
, parallel
and pdftk
(all available on debian-based distributions). It's largely based on this blog post.
#!/bin/sh -ex
density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given
convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress
# Cleanup temp files
rm page_?????.tif page_?????.pdf
New contributor
add a comment |
This is my quick and dirty solution based on ImageMagick's convert
, tesseract
, parallel
and pdftk
(all available on debian-based distributions). It's largely based on this blog post.
#!/bin/sh -ex
density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given
convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress
# Cleanup temp files
rm page_?????.tif page_?????.pdf
New contributor
add a comment |
This is my quick and dirty solution based on ImageMagick's convert
, tesseract
, parallel
and pdftk
(all available on debian-based distributions). It's largely based on this blog post.
#!/bin/sh -ex
density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given
convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress
# Cleanup temp files
rm page_?????.tif page_?????.pdf
New contributor
This is my quick and dirty solution based on ImageMagick's convert
, tesseract
, parallel
and pdftk
(all available on debian-based distributions). It's largely based on this blog post.
#!/bin/sh -ex
density=${2:-"300"} # default to 300 DPI if 2nd parameter is not given
convert -monitor -density "$density" "$1" -monochrome -compress lzw -alpha deactivate page_%05d.tif
parallel --bar "tesseract {} {.} pdf 2>/dev/null" ::: page_*.tif
pdftk page_*.pdf cat output "${1%.*}-ocred.pdf" compress
# Cleanup temp files
rm page_?????.tif page_?????.pdf
New contributor
New contributor
answered 7 hours ago
stefanctstefanct
1011
1011
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f147679%2fadding-ocr-info-to-a-pdf%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Duplicate? askubuntu.com/questions/16268/…
– Jakob
Jun 7 '12 at 9:04
4
@Jakob, I don't think it's a dupe, we are asking different things. The other question is about extracting text from some pdf (i.e. generating corresponding txt files), while my question is about modifying the pdf in order to add ocr information and make work the search function in the pdf reader. I'll clarify the question.
– fdierre
Jun 7 '12 at 10:17
How, and what did you use to scan the document?
– Mitch♦
Jun 7 '12 at 11:05
@Mitch I used my office Ricoh Aficio MP-C2500 printer/copier/scanner, which has a very nice document feeder. :-)
– fdierre
Jun 7 '12 at 12:06
Scanning and/or OCR Software?
– Mitch♦
Jun 7 '12 at 12:18