summaryrefslogtreecommitdiff
path: root/docs/posts/2022-11-07-a-new-method-to-blog.html
blob: 3eb3b7ed06ae4faa5321f3b7c8ec3893d45521d3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
<!DOCTYPE html>
<html lang="en">
<head>
    
    <link rel="stylesheet" href="https://unpkg.com/latex.css/style.min.css" />
    <link rel="stylesheet" href="/assets/main.css" />
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>A new method to blog</title>
    <meta name="og:site_name" content="Navan Chauhan" />
    <link rel="canonical" href="https://web.navan.dev/posts/2022-11-07-a-new-method-to-blog.html" />
    <meta name="twitter:url" content="https://web.navan.dev/posts/2022-11-07-a-new-method-to-blog.html />
    <meta name="og:url" content="https://web.navan.dev/posts/2022-11-07-a-new-method-to-blog.html" />
    <meta name="twitter:title" content="A new method to blog" />
    <meta name="og:title" content="A new method to blog" />
    <meta name="description" content="Writing posts in markdown using pen and paper" />
    <meta name="twitter:description" content="Writing posts in markdown using pen and paper" />
    <meta name="og:description" content="Writing posts in markdown using pen and paper" />
    <meta name="twitter:card" content="summary_large_image" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="shortcut icon" href="/images/favicon.png" type="image/png" />
    <link rel="alternate" href="/feed.rss" type="application/rss+xml" title="Subscribe to Navan Chauhan" />
    <meta name="twitter:image" content="https://web.navan.dev/images/opengraph/posts/2022-11-07-a-new-method-to-blog.png" />
    <meta name="og:image" content="https://web.navan.dev/images/opengraph/posts/2022-11-07-a-new-method-to-blog.png" />
    <meta name="google-site-verification" content="LVeSZxz-QskhbEjHxOi7-BM5dDxTg53x2TwrjFxfL0k" />
    <script data-goatcounter="https://navanchauhan.goatcounter.com/count"
        async src="//gc.zgo.at/count.js"></script>
    <script defer data-domain="web.navan.dev" src="https://plausible.io/js/plausible.js"></script>
    <link rel="manifest" href="manifest.json" />
    
</head>
<body>
    <center><nav style="display: block;">
|
<a href="/">home</a> |
<a href="/about/">about/links</a> |
<a href="/posts/">posts</a> |
<!--<a href="/publications/">publications</a> |-->
<!--<a href="/repo/">iOS repo</a> |-->
<a href="/feed.rss">RSS Feed</a> |
</nav>
</center>
    
<main>

	<h1>A new method to blog</h1>

<p><em><a rel="noopener" target="_blank" href="/assets/pdfs/2022-11-07-a-new-way-to-blog.pdf">Here</a> is the original PDF. I made some edits to the content after generating the markdown file</em></p>

<p><a rel="noopener" target="_blank" href="https://paperwebsite.com">Paper Website</a> is a service that lets you build a website with just pen and paper. I am going to try and replicate the process.</p>

<h2>The Plan</h2>

<p>The continuity feature on macOS + iOS lets you scan PDFs directly from your iPhone. I want to be able to scan these pages and automatically run an Automator script that takes the PDF and OCRs the text. Then I can further clean the text and convert from markdown.</p>

<h2>Challenges</h2>

<p>I quickly realised that the OCR software I planned on using could not detect my shitty handwriting accurately. I tried using ABBY Finereader, Prizmo and OCRMyPDF. (Abby Finereader and Prizmo support being automated by Automator).</p>

<p>Now, I could either write neater, or use an external API like Microsoft Azure</p>

<h2>Solution</h2>

<h3>OCR</h3>

<p>In the PDFs, all the scans are saved as images on a page. I extract the image and then send it to Azure's API. </p>

<h3>Paragraph Breaks</h3>

<p>The recognised text had multiple lines breaking in the middle of the sentence, Therefore, I use what is called a <a rel="noopener" target="_blank" href="https://en.wikipedia.org/wiki/Pilcrow">pilcrow</a> to specify paragraph breaks. But, rather than trying to draw the normal pilcrow, I just use the HTML entity <code>&amp;#182;</code> which is the pilcrow character. </p>

<h2>Where is the code?</h2>

<p>I created a <a rel="noopener" target="_blank" href="https://gist.github.com/navanchauhan/5fc602b1e023b60a66bc63bd4eecd4f8">GitHub Gist</a> for a sample Python script to take the PDF and print the text </p>

<p>A more complete version with Auomator scripts and an entire publishing pipeline will be available as a GitHub and Gitea repo soon.</p>

<p><em>* In Part 2, I will discuss some more features *</em> </p>

	<blockquote>If you have scrolled this far, consider subscribing to my mailing list <a href="https://listmonk.navan.dev/subscription/form">here.</a> You can subscribe to either a specific type of post you are interested in, or subscribe to everything with the "Everything" list.</blockquote>
	<script data-isso="https://comments.navan.dev/"
        src="https://comments.navan.dev/js/embed.min.js"></script>
	<section id="isso-thread">
	    <noscript>Javascript needs to be activated to view comments.</noscript>
	</section>
</main>

    <script src="assets/manup.min.js"></script>
    <script src="/pwabuilder-sw-register.js"></script>    
</body>
</html>