The tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.
Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.
OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:
The ocr()
function extracts text from an image file.
After indicating the engine for the language, it will return the text
found in the image:
library(cpp11tesseract)
file <- system.file("examples", "testocr.png", package = "cpp11tesseract")
eng <- tesseract("eng")
text <- ocr(file, engine = eng)
cat(text)
## This is a lot of 12 point text to test the
## ocr code and see if it works on all types
## of file format.
##
## The quick brown dog jumped over the
## lazy fox. The quick brown dog jumped
## over the lazy fox. The quick brown dog
## jumped over the lazy fox. The quick
## brown dog jumped over the lazy fox.
The ocr_data()
function returns all words in the image
along with a bounding box and confidence rate.
## # A tibble: 60 × 4
## word confidence bbox stringsAsFactors
## <chr> <dbl> <chr> <lgl>
## 1 This 96.8 36,92,96,116 FALSE
## 2 is 96.9 109,92,129,116 FALSE
## 3 a 95.0 141,98,156,116 FALSE
## 4 lot 95.0 169,92,201,116 FALSE
## 5 of 96.4 212,92,240,116 FALSE
## 6 12 96.4 251,92,282,116 FALSE
## 7 point 96.3 296,92,364,122 FALSE
## 8 text 96.2 374,93,427,116 FALSE
## 9 to 97.0 437,93,463,116 FALSE
## 10 test 97.0 474,93,526,116 FALSE
## # ℹ 50 more rows
The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language.
Use tesseract_info()
to list the languages that you
currently have installed.
## $datapath
## [1] "/usr/share/tesseract-ocr/5/tessdata/"
##
## $available
## [1] "eng" "osd"
##
## $version
## [1] "5.3.4"
##
## $configs
## [1] "alto" "ambigs.train" "api_config" "bigram"
## [5] "box.train" "box.train.stderr" "digits" "get.images"
## [9] "hocr" "inter" "kannada" "linebox"
## [13] "logfile" "lstm.train" "lstmbox" "lstmdebug"
## [17] "makebox" "pdf" "quiet" "rebox"
## [21] "strokewidth" "tsv" "txt" "unlv"
## [25] "wordstrbox"
By default the R package only includes English training data. Windows
and Mac users can install additional training data using
tesseract_download()
. Let’s OCR a screenshot from Wikipedia
in Simplified Chinese.
# Now load the dictionary
(simplified_chinese <- tesseract("chi_sim"))
file <- system.file("examples", "chinese.jpg", package = "cpp11tesseract")
text <- ocr(file, engine = simplified_chinese)
cat(text)
Compare with the copy and paste from the Wikipedia.
## 奧林匹克運動會(希臘語:Ολυμπιακοί Αγώνες;法語 Jeux olympiques;英語:Olympic Games)簡稱奧運會、奧運,是世界最高等級的國際綜合體育賽事,由國際 奧林匹克委員會主辦,每4年舉行一次。冬季競技項目創立冬季奧林匹克運動會後,之前 的奧林匹克運動會則是又稱為「夏季奧林匹克運動會」以示區分。從1994年起,冬季奧 運會和夏季奧運會分開,相隔2年交替舉行。奥林匹克運動會最早起源於古希腊,是當時 各城邦之間的公開較量,因為舉辦地在奧林匹亚而得名。信奉基督教的羅馬皇帝狄奧多西 一世以奧林匹克運動會崇拜耶穌以外神衹為由,禁止奧運競技,於是奧運在舉辦超過 1,000年後於4世紀末停辦,奧運這次停辦持續了1,503年,直到19世纪末才由後人發現 遺蹟。之後,法國的顾拜旦男爵皮耶·德·古柏坦創立了有真正奧運精神的現代奧林匹克運 動會,自1896年開始每4年舉辦一次,更確立了會期不超過18日的傳統。現代奧運會只 在兩次世界大戰期間合共中斷過5次(分別是1916年夏季奧運會、1940年夏季奧運會 [1]、1940年冬季奧運會[1]、1944年夏季奧運會和1944年冬季奧運會)[註 1],以及在 2020年因全球防疫延期過一次(2020年夏季奧運會[2][註 2])。
The accuracy of the OCR process depends on the quality of the input image. You can often improve results by properly scaling the image, removing noise and artifacts or cropping the area where the text exists. See tesseract wiki: improve quality for important tips to improve the quality of your input image.
The awesome magick R package has many useful functions that can be use for enhancing the quality of the image. Some things to try:
image_deskew()
and
image_rotate()
make the text horizontal.image_trim()
crops out whitespace in the margins.
Increase the fuzz
parameter to make it work for noisy
whitespace.image_convert()
to turn the image into greyscale,
which can reduce artifacts and enhance actual text.image_resize()
can help tesseract determine text size.image_modulate()
or image_contrast()
or image_contrast()
to tweak brightness / contrast if this
is an issue.image_reducenoise()
for automated noise removal.
Your mileage may vary.image_quantize()
you can reduce the number of
colors in the image. This can sometimes help with increasing contrast
and reducing artifacts.image_convolve()
to use
custom convolution
methods.Below is an example OCR scan. The code converts it to black-and-white and resizes + crops the image before feeding it to tesseract to get more accurate OCR results.
## Linking to ImageMagick 6.9.12.98
## Enabled features: fontconfig, freetype, fftw, heic, lcms, pango, raw, webp, x11
## Disabled features: cairo, ghostscript, rsvg
## Using 4 threads
file <- system.file("examples", "wilde.jpg", package = "cpp11tesseract")
input <- image_read(file)
text <- input %>%
image_resize("2000x") %>%
image_convert(type = "Grayscale") %>%
image_trim(fuzz = 40) %>%
image_write(format = "png", density = "300x300") %>%
ocr()
cat(text)
## Act One
##
## [The living room of Algernon Moncrieff's flat in Mayfair, London.
## Lane is arranging afternoon tea on a table. Algernon enters]
## Algernon: Lane, have you made the cucumber sandwiches for
## Lady Bracknell’s tea?
##
## Lane: Yes, sir. [Handing them to Algernon on a silver tray]
## Algernon: [Looking carefully at them, taking two and sitting down
## on the sofa] Oh, by the way", Lane, I looked at your notebook. |
## toticed that when Lord Shoreman and Mr Worthing dined with
## me on Thursday night, eight bottles of champagne were drunk,
## Lane: Yes, sir; eight bottles.
##
## Algernon: Why is it that, in a bachelor’s home, the servants
## always drink the champagne? | just ask because I am interested,
## Lane.
##
## Lane: | think that it is because the champagne is better in a
## lachelor’s home. | have noticed that the champagne in married
## people’s homes is rarely very good.
##
## Algernon: Good heavens’! Is marriage so depressing?
##
## Lane: | believe marriage is very pleasant, sir. | haven't had much
## experience of it myself. I have only been married once, and that
## was because of a misunderstanding*® between myself and a young
## person.
##
## Algernon: [Lazily, without interest] 1 am not very interested in
## your family life, Lane.
##
## Lane: No, sir; it is not a very interesting subject. | never think
## of it myself.
##
## Algernon: That is very understandable. Well, thank you, Lane.
## [Lane goes off]
##
## Algernon: [To himself] Lane’s views on marriage seem very casual.
## Really, if the servants don’t set us a good example, what on earth
## is the use of them? They seem to have no morals",
## {Lane enters]
##
## Lane: Mr Ernest Worthing is here, sir.
If your images are stored in PDF files they first need to be
converted to a proper image format. We can do this in R using the
pdf_convert
function from the pdftools package. Use a high
DPI to keep quality of the image.
file <- system.file("examples", "ocrscan.pdf", package = "cpp11tesseract")
pngfile <- pdftools::pdf_convert(file, dpi = 600)
## Converting page 1 to ocrscan_1.png... done!
## | SAPORS LANE - BOOLE - DORSET - BH 25 8 ER
## TELEPHONE BOOLE (945 13) 51617 - TELEX 123456
##
## Our Ref. 350/PJC/EAC 18th January, 1972.
## Dr. P.N. Cundall,
## Mining Surveys Ltd.,
## Holroyd Road,
## Reading,
## Berks.
## Dear Pete,
##
## Permit me to introduce you to the facility of facsimile
## transmission.
##
## In facsimile a photocell is caused to perform a raster scan over
##
## the subject copy. The variations of print density on the document
## cause the photocell to generate an analogous electrical video signal.
## This signal is used to modulate a carrier, which is transmitted to a
## remote destination over a radio or cable communications link.
##
## At the remote terminal, demodulation reconstructs the video
## signal, which is used to modulate the density of print produced by a
## printing device. This device is scanning in a raster scan synchronised
## with that at the transmitting terminal. As a result, a facsimile
## copy of the subject document is produced.
##
## Probably you have uses for this facility in your organisation.
##
## Yours sincerely,
## Ay, f
## P.J. CROSS
## Group Leader - Facsimile Research
## Registered in England: No. 2038
## No. 1 Registered Office: GO Vicara Lane, Ilford. Essex.
Tesseract supports hundreds of “control parameters” which alter the
OCR engine. Use tesseract_params()
to list all parameters
with their default value and a brief description. It also has a handy
filter
argument to quickly find parameters that match a
particular string.
## # A tibble: 2 × 3
## param default desc
## * <chr> <chr> <chr>
## 1 editor_image_word_bb_color 7 Word bounding box colour
## 2 editor_image_blob_bb_color 4 Blob bounding box colour
Do note that some of the control parameters have changed between Tesseract engine 3 and 4.
## $version
## [1] "5.3.4"
One powerful parameter is tessedit_char_whitelist
which
restricts the output to a limited set of characters. This may be useful
for reading for example numbers such as a bank account, zip code, or gas
meter.
The whitelist parameter works for all versions of Tesseract engine 3 and also engine versions 4.1 and higher, but unfortunately it did not work in Tesseract 4.0.
file <- system.file("examples", "receipt.jpg", package = "cpp11tesseract")
numbers <- tesseract(options = list(tessedit_char_whitelist = "-$.0123456789"))
cat(ocr(file, engine = numbers))
## 0
##
## 00068354712539
##
## 01.8$31.998
## 25 -$8.00
##
## 00084019961505
##
## 03966$44.99
##
## 00003558543582
##
## 8 $8.93
##
## $
##
## 00000002000414
##
## $0.50
##
## $$60$10 -$10.00
##
## $ $68.47
##
## $8.84
##
## $77.31
To test if this actually works, look at the output without the whitelist:
## DOG
##
## 000683547 12539
##
## OPEN FARM DOG AG SALMON 1.8KG $31.99 HST
## Item discount 25% -$8.00 HST
##
## 00084019961505
##
## VE FO GOOG BF NIB 396G LRG KONG $44.99 HST
##
## ACCESSORIES
##
## 00003558543582
##
## KONG BRUSH $8.93 HST
##
## STORE USE ITEMS
##
## 000000020004 14
##
## GPF CLOTH BAG LARGE $0.50
##
## FPS SPEND $60 SAVE $10 -$10.00
##
## SUB TOTAL $68.47
##
## HST $8.84
##
## TOTAL $77.31
As an Easter egg, this is Mr. Duke:
Here is the extracted text:
file <- system.file("examples", "mrduke.jpg", package = "cpp11tesseract")
text <- ocr(file, engine = eng)
cat(text)
## ee
## Wear. See
## yor Cee 2 ee ee
## ys uae
## ot
## a Od —
## teeta We
## an eee
## oe e
## — Nii a
## = ¢ ae
## a. ae es
## Ze. <n BR ee
## ee Ih Rae
## eee ee
## Mr. Duke, 4 years old (2024) 2
## sea ee a Bass
In order to improve the OCR results, Tesseract has two variants of
models that can be used. The tesseract_download()
can
download the ‘best’ (but slower) model, which increases the accuracy.
The ‘fast’ (but less accurate) model is the default.
file <- system.file("examples", "chinese.jpg", package = "cpp11tesseract")
# download the best model (vertical script download is to avoid a warning)
dir <- tempdir()
tesseract_download("chi_sim_vert", datapath = dir, model = "best")
tesseract_download("chi_sim", datapath = dir, model = "best")
# compare the results: fast (text1) vs best (text2)
text1 <- ocr(file, engine = tesseract("chi_sim"))
text2 <- ocr(file, engine = tesseract("chi_sim", datapath = dir))
cat(text1)
cat(text2)
The tesseract_contributed_download()
function can
download contributed models. For example, the grc_hist
model is useful for Polytonic Greek. Here is an example from Sophocles’
Ajax (source: Ajax
Multi-Commentary)
file <- system.file("examples", "polytonicgreek.png", package = "cpp11tesseract")
# download the best models
dir <- tempdir()
tesseract_download("grc", datapath = dir, model = "best")
tesseract_contributed_download("grc_hist", datapath = dir, model = "best")
# compare the results: grc (text1) vs grc_hist (text2)
text1 <- ocr(file, engine = tesseract("grc", datapath = dir))
text2 <- ocr(file, engine = tesseract("grc_hist", datapath = dir))
cat(text1)
cat(text2)
Note: Amazon and Textract are trademarks of Amazon.com, Inc.
Textract documentation uses page three of the January 1966 report from Philadelphia Fed’s Tealbook (formerly Greenbook).
Here is the first element of the list returned by Textract:
# List of 13
# $ BlockType : chr "TABLE"
# $ Confidence : num 100
# $ Text : chr(0)
# $ RowIndex : int(0)
# $ ColumnIndex : int(0)
# $ RowSpan : int(0)
# $ ColumnSpan : int(0)
# $ Geometry :List of 2
# .. <not shown>
# $ Id : chr "c6841638-d3e0-414b-af12-b94ed34aac8a"
# $ Relationships :List of 1
# ..$ :List of 2
# .. ..$ Type: chr "CHILD"
# .. ..$ Ids : chr [1:256] "e1866e80-0ef0-4bdd-a6fd-9508bb833c03" ...
# $ EntityTypes : list()
# $ SelectionStatus: chr(0)
# $ Page : int 3
Here is Tesseract’s output:
file <- system.file("examples", "tealbook.png", package = "cpp11tesseract")
text <- ocr(file)
cat(text)
## Nemes mm a a ee en e-em n an ae ee
## Year SSC—~SSESSC~*«C
## 1965 IV I
## Esti- Esti- Pro-
## 1964 __mated yi/ rr/ rrr! mated _ jected
## Gross National Product 628.7 675.7 657.6 668.8 681.5 695.0 707.0
## Personal consumption expenditures 398.9 428.6 416.9 424.5 432.5 440.5 447.1
## Durable goods 58.7 65.0 64.6 63.5 65.4 66.4 66.6
## Nondurable goods 177.5 188.8 182.8 187.9 190.5 194.0 197.6
## Services 162.6 174.9 169.5 173,1 176.7 180.1 182.9
## Gross private domestic investment 92.9 104.9 103.4 102.8 106.2 107.0 109.1
## Residential construction 27.5 27.7 27.7 28.0 27.7 27.3 27.5
## Business fixed investment 60.5 69.8 66.9 68.4 70.9 73.1 75.1
## Change in business inventories 4.8 7.4 8.8 6.4 7.6 6.6 6.5
## Nonfarm 5.4 7.1 9.2 6.6 7.0 5.4 5.5
## Net exports 8.6 7.3 6.0 8.0 7.4 7.8 8.1
## Gov. purchases of goods & services 128.4 135,0 131.3 133.5 135.4 139.7 142.7
## Federal 65.3 66.6 64.9 65.7 66.5 69.4 70.7
## Defense 49.9 49.9 48.8 49.2 49.8 51.8 52.7
## Other 15.4 16.7 16.1 16.5 16.7 17.6 18.0
## State and local 63.1 68,4 66.4 67.8 68.9 70.3 72.0
## Gross National Product in Constant 577.6 609.3 597.7 603.5 613.0 622.4 630.1
## (1958) Dollars
## Personal income 495.0 530.5 516.2 524.7 536.0 544.9 552.0
## Wages and salaries 333.5 357.3 348.9 353.6 359.0 367.5 374.1
## Farm income 12.0 14.2 12.0 14.5 15.0 15.3 15.3
## Personal contributions for
## social insurance (deduction) 12.4 13.2 12.9 13.0 13.3 13.6 16.6
## Disposable personal income 435.8 465.0 451.4 458.5 471.2 478.7 485.1
## Personal saving 26.3 24.6 23.3 22.4 26.8 26.0 25.5
## Saving rate (per cent) 6.0 5.3 5.2 4.9 5.7 5.4 5.3
## Total labor force (millions) 77.0 78.3 77.7. 78.2 78.5 78.9 79.6
## Armed forces " 2.7 2.7 2.7 2.7 2.7 2.8 2.9
## Civilian labor force " 74.2 75.6 75.0 75.5 75.8 76,1 76,7
## Employed " 70.4 72.1 71.3 71.9 72.4 72.9 73.6
## Unemployed " 3.9 3.5 3.6 3.6 3.4 3.2 3.1
## Unemployment rate (per cent) 5.2 4.6 4.8 4.7 4.4 4.2 4.0
One way to organize the output is to split the text before the first digit on each line.
text <- strsplit(text, "\n")[[1]]
text <- text[6:length(text)]
for (i in seq_along(text)) {
firstdigit <- regexpr("[0-9]", text[i])[1]
variable <- trimws(substr(text[i], 1, firstdigit - 1))
values <- strsplit(substr(text[i], firstdigit, nchar(text[i])), " ")[[1]]
values <- trimws(gsub(",", ".", values))
values <- suppressWarnings(as.numeric(gsub("\\.$", "", values)))
if (length(values[!is.na(values)]) < 1) {
next
}
res <- c(variable, values)
names(res) <- c(
"variable", "y1964", "y1965est", "y1965q1",
"y1965q2", "y1965q3", "y1965q4est", "y1966q1pro"
)
if (i == 1) {
df <- as.data.frame(t(res))
} else {
df <- rbind(df, as.data.frame(t(res)))
}
}
head(df)
## variable y1964 y1965est y1965q1 y1965q2 y1965q3
## 1 Gross National Product 628.7 675.7 657.6 668.8 681.5
## 2 Personal consumption expenditures 398.9 428.6 416.9 424.5 432.5
## 3 Durable goods 58.7 65 64.6 63.5 65.4
## 4 Nondurable goods 177.5 188.8 182.8 187.9 190.5
## 5 Services 162.6 174.9 169.5 173.1 176.7
## 6 Gross private domestic investment 92.9 104.9 103.4 102.8 106.2
## y1965q4est y1966q1pro
## 1 695 707
## 2 440.5 447.1
## 3 66.4 66.6
## 4 194 197.6
## 5 180.1 182.9
## 6 107 109.1
The result is not perfect (e.g. I still need to change “Gross National Product in Constant” to add the “(1958) Dollars”), but neither is Textract’s and it requires to write a more complex loop to organize the data. Certainly, this can be simplified by using the Tidyverse.