~evan-hoose/a-shared-404

ref: 8c3857932247da94bda7fb206d3e881762ec0229 a-shared-404/top/blog/a-better-search-solution/index.html -rw-r--r-- 6.9 KiB
8c385793Evan modified: top/blog/a-better-search-solution/a-better-search-solution.md 11 months ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
<doctype html>
<head>
<title>AS4 | Evan Hoose</title>
<style>
body {
	font-family: "Lucida Console", Monaco, monospace;
	padding: 0px 10% 0px;
	/*background-color:  #001214;*/
	background-color: #282828;
}
p,ol,ul {
	/*color: #93a1a1;*/
	color: #ebdbb2;
}

.tab-bar-hr {
	background-color: #504945;
	color: #504945;
}

.active {
	/*background-color: #484848;*/
	background-color: #504945;
	padding: 9px 9px 9px;
}

.inactive {
	padding: 9px 9px 9px;
}
.inactive:hover {
	background-color: #504945;
}
h1 {
	/*color: #839496;*/
	color: #d79921;
}
h2,h3,h4,h5,h6 {
	/*color: #5f5faf;*/
	color: #d3869b
}
a {
	/*color: #2aa198;*/
	color: #83a598;
}
hr {
	/*color: #93a1a1;*/
	color: #ebdbb2;
}

code {
	padding: 5px 5px 5px;
	color: #a89984;
	background-color: #32302f;
}

.footer {
	color: #504945;
}
</style>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>

<body>
<div id="header">
        <h1>A Shared 404 |</h1>
        <p>Evan Hoose's Website, Blog and stomping ground</p>
        <a href="/" class="inactive">Home</a>
        <a href="/hire-me" class="inactive">Hire Me</a>
        <a href="/blog" class="active">Blog</a>
	<a href="/programs" class="inactive">Programs</a>
        <a href="/tutorials" class="inactive">Tutorials</a>
        <a href="/other-stuff" class="inactive">Other Stuff</a>
        <hr class="tab-bar-hr">
</div>
<h1>A better search solution.</h1>

<p>Drew DeVault wrote <a href="https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html">this</a>.
(Read that first. It'll provide useful context I won't explain.)</p>

<p>I like Drew's work, and this got me thinking.</p>

<p>What is the best way to implement a search engine in this style?</p>

<hr />

<h2>Before you begin:</h2>

<p>This is a living document. I will make edits and other changes without warning.</p>

<p>I am mostly using this as notekeeping for myself, and have made it publicly
availalbe in the hopes that it will be useful. </p>

<p>I do intend to keep playing around with this, but it is strictly on a spare
time/I feel like it basis.</p>

<h2>Resources for learning:</h2>

<p>As I've started studying this, I've come across some resources. I'll link them
here in case anyone else is looking.</p>

<p><a href="https://nlp.stanford.edu/IR-book/information-retrieval-book.html">Introduction to Information Retrieval</a>
 -- A book by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. 
Just started it, but it looks promising.</p>

<p><a href="https://nlp.stanford.edu/IR-book/information-retrieval.html">Information Retrieval Resources</a>
 -- Resource link dump provided by the above authors.</p>

<p><a href="http://www.ardendertat.com/2011/05/30/how-to-implement-a-search-engine-part-1-create-index/">Arden Dertat's Blog</a>
 -- Conveniently, Arden Dertat has a series of blog posts about building search
engines. It looks good, but I don't know enough about the topic to confirm or deny. </p>

<p><a href="https://en.wikipedia.org/wiki/Search_engine_indexing">Wikipedia: Search engine indexing</a>
 -- Exactly what it sounds like.</p>

<h2>Proposed Architecture</h2>

<p>NOTE: Most of what is described below is either frontend, or the very front of
the backend. Why? Because that's what I knew enough to write about when I
started. I'm currently studying the resources linked above, and will update as
I learn more.</p>

<p>We will have three main components:</p>

<ul>
<li>The web crawler</li>
<li>The data servers</li>
<li>The client</li>
</ul>

<h3>The Web Crawler</h3>

<p>I would argue that the crawler should take a list of domains to use as 'Tier 1'
and then crawl links as described in Drew's post.</p>

<p>This crawler would then output a database which can be fed into the data 
servers, in both centralized and decentralized forms (for example, after running
a crawl, the crawler could offer to also submit the database to the centralized
data server).</p>

<p>This should help build up a central database, while still making it simple to run
your own instance with a given focus.</p>

<p>In order to combat abuse, I would have the central server do some form of sanity
check against database, as well as blacklisting sites that appear to be 
blogspam or otherwise not useful.</p>

<h3>The data/search servers</h3>

<p>These servers will take the databases outputed from the crawler, and use them
to respond to search requests.</p>

<p>However, unlike a traditional search engine, these servers will only respond
with data in some serialized format.</p>

<p>The purposes of this are multifold.</p>

<p>First, it simplifies the amount of development needed for the server, which seems
like a Good Thing, to me at least.</p>

<p>Second, it would allow for more <em>useful</em> data to be sent with the same amount of
bandwidth.</p>

<p>Third, it allows multiple search clients to be used, each with as many or as few
features as they please. One feature in particular that I would find useful would
be the ability to create a local search database that could be compared against
before sending out network requests.</p>

<h3>The client</h3>

<p>The most basic client could be as simple as a web page which renders whatever data
it receives from the central server.</p>

<p>Some other features that could be included client-side:</p>

<ul>
<li><p>Building of a local search database.</p></li>
<li><p>Blacklisting of domains that you personally find non-useful.</p></li>
<li><p>A launcher for the web-crawler, which could be used to improve results 
that you find lacking.</p></li>
<li><p>A switcher for which search server you want to connect to.</p></li>
</ul>

<p>And probably a few more, but the above would be my wishlist.</p>

<hr />

<h2>Pros/Cons of this architecture.</h2>

<p>Pros:</p>

<ul>
<li><p>In my mind at least, this strikes a good balance between de/centralization.</p></li>
<li><p>The search server is much simplified down from what it would have to be 
for a traditional engine.</p></li>
<li><p>It should be more simple to write your own clients/servers, as you could 
have standardized formats for search results and search databases.</p></li>
</ul>

<p>Cons:</p>

<ul>
<li><p>Is dependent on user having access to a good client. There could of course
be one provided, but unscrupulous people could sabotage the network by
providing/advertising subpar providers.</p></li>
<li><p>Possibly leans to far towards decentralization. (Would users running crawls
submit them to the central server?)</p></li>
<li><p>Related to the first two, is dependent on a given provider giving access to
a good search server.</p></li>
<li><p>Is designed by someone who doesn't know much about either crawling or 
indexing, and therefore may be totally unviable for reasons I don't understand.</p></li>
</ul>
<div id="footer" class="footer">
        <hr style="color:#504945">
	<code>
	<p>A Shared 404: Evan Hoose's Website, Blog and stomping ground.</p>
	<p>The contents of this site are under the CC Attribution license, and 
	the code of the site generator is under the GPL V3.</p>
	</code>
</div>
</body>