~evan-hoose/a-shared-404

f526ae9a2ff0319b8e989ea7d9a1151ab4f552b6 — Evan 10 months ago f231c9d
On branch master
Changes to be committed:
	modified:   index.html
	new file:   top/blog/a-less-stupid-search-solution/index.html
	new file:   top/blog/a-less-stupid-search-solution/less-stupid-search.md
 Changes not staged for commit:
	deleted:    footer.html
M index.html => index.html +1 -3
@@ 72,9 72,7 @@ code {
        <a href="/other-stuff" class="inactive">Other Stuff</a>
        <hr style="color:#504945">
</div>
<h1>Literally just my personal site.</h1>

<p>Or it would be, if I could get a domain name.</p>
<h1>Literally just my personal site..</h1>

<p>Can be built using <a href="git.sr.ht/~evan-hoose/SSSSS">SSSSS</a>.</p>
<div id="footer" class="footer">

A top/blog/a-less-stupid-search-solution/index.html => top/blog/a-less-stupid-search-solution/index.html +186 -0
@@ 0,0 1,186 @@
<doctype html>
<head>
<title>AS4 | Evan Hoose</title>
<style>
body {
	font-family: "Lucida Console", Monaco, monospace;
	padding: 0px 10% 0px;
	/*background-color:  #001214;*/
	background-color: #282828;
}
p,ol,ul {
	/*color: #93a1a1;*/
	color: #ebdbb2;
}

.tab-bar-hr {
	background-color: #504945;
	color: #504945;
}

.active {
	/*background-color: #484848;*/
	background-color: #504945;
	padding: 9px 9px 9px;
}

.inactive {
	padding: 9px 9px 9px;
}
.inactive:hover {
	background-color: #504945;
}
h1 {
	/*color: #839496;*/
	color: #d79921;
}
h2,h3,h4,h5,h6 {
	/*color: #5f5faf;*/
	color: #d3869b
}
a {
	/*color: #2aa198;*/
	color: #83a598;
}
hr {
	/*color: #93a1a1;*/
	color: #ebdbb2;
}

code {
	padding: 5px 5px 5px;
	color: #a89984;
	background-color: #32302f;
}

.footer {
	color: #504945;
}
</style>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>

<body>
<div id="header">
        <h1>A Shared 404 |</h1>
        <p>Evan Hoose's Website, Blog and stomping ground</p>
        <a href="/" class="inactive">Home</a>
        <a href="/hire-me" class="inactive">Hire Me</a>
        <a href="/blog" class="active">Blog</a>
	<a href="/programs" class="inactive">Programs</a>
        <a href="/tutorials" class="inactive">Tutorials</a>
        <a href="/other-stuff" class="inactive">Other Stuff</a>
        <hr class="tab-bar-hr">
</div>
<h1>A less stupid search solution.</h1>

<p>Drew DeVault wrote <a href="https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html">this</a>.
(Read that first. It'll provide useful context I won't explain.)</p>

<p>I like Drew's work, and this got me thinking.</p>

<p>What is the best way to implement a search engine in this style?</p>

<hr />

<h2>Proposed Architecture</h2>

<p>We will have three main components:</p>

<ul>
<li>The web crawler</li>
<li>The data servers</li>
<li>The client</li>
</ul>

<h3>The Web Crawler</h3>

<p>I would argue that the crawler should take a list of domains to use as 'Tier 1'
and then crawl links as described in Drew's post.</p>

<p>This crawler would then output a database which can be fed into the data 
servers, in both centralized and decentralized forms (for example, after running
a crawl, the crawler could offer to also submit the database to the centralized
data server).</p>

<p>This should help build up a central database, while still making it simple to run
your own instance with a given focus.</p>

<p>In order to combat abuse, I would have the central server do some form of sanity
check against database, as well as blacklisting sites that appear to be 
blogspam or otherwise not useful.</p>

<h3>The data/search servers</h3>

<p>These servers will take the databases outputed from the crawler, and use them
to respond to search requests.</p>

<p>However, unlike a traditional search engine, these servers will only respond
with data in some serialized format.</p>

<p>The purposes of this are multifold.</p>

<p>First, it simplifies the amount of development needed for the server, which seems
like a Good Thing, to me at least.</p>

<p>Second, it would allow for more <em>useful</em> data to be sent with the same amount of
bandwidth.</p>

<p>Third, it allows multiple search clients to be used, each with as many or as few
features as they please. One feature in particular that I would find useful would
be the ability to create a local search database that could be compared against
before sending out network requests.</p>

<h3>The client</h3>

<p>The most basic client could be as simple as a web page which renders whatever data
it receives from the central server.</p>

<p>Some other features that could be included client-side:</p>

<ul>
<li><p>Building of a local search database.</p></li>
<li><p>Blacklisting of domains that you personally find non-useful.</p></li>
<li><p>A launcher for the web-crawler, which could be used to improve results 
that you find lacking.</p></li>
<li><p>A switcher for which search server you want to connect to.</p></li>
</ul>

<p>And probably a few more, but the above would be my wishlist.</p>

<hr />

<h2>Pros/Cons of this architecture.</h2>

<p>Pros:</p>

<ul>
<li><p>In my mind at least, this strikes a good balance between de/centralization.</p></li>
<li><p>The search server is much simplified down from what it would have to be 
for a traditional engine.</p></li>
<li><p>It should be more simple to write your own clients/servers, as you could 
have standardized formats for search results and search databases.</p></li>
</ul>

<p>Cons:</p>

<ul>
<li><p>Is dependent on user having access to a good client. There could of course
be one provided, but unscrupulous people could sabotage the network by
providing/advertising subpar providers.</p></li>
<li><p>Possibly leans to far towards decentralization. (Would users running crawls
submit them to the central server?)</p></li>
<li><p>Related to the first two, is dependent on a given provider giving access to
a good search server.</p></li>
<li><p>Is designed by someone who doesn't know much about either crawling or 
indexing, and therefore may be totally unviable for reasons I don't understand.</p></li>
</ul>
<div id="footer" class="footer">
        <hr style="color:#504945">
	<code>
	<p>A Shared 404: Evan Hoose's Website, Blog and stomping ground.</p>
	<p>The contents of this site are under the CC Attribution license, and 
	the code of the site generator is under the GPL V3.</p>
	</code>
</div>
</body>

A top/blog/a-less-stupid-search-solution/less-stupid-search.md => top/blog/a-less-stupid-search-solution/less-stupid-search.md +108 -0
@@ 0,0 1,108 @@
#A less stupid search solution.

Drew DeVault wrote [this](https://drewdevault.com/2020/11/17/Better-than-DuckDuckGo.html).
(Read that first. It'll provide useful context I won't explain.)

I like Drew's work, and this got me thinking.

What is the best way to implement a search engine in this style?

---

##Proposed Architecture

We will have three main components:

* The web crawler
* The data servers
* The client

###The Web Crawler

I would argue that the crawler should take a list of domains to use as 'Tier 1'
and then crawl links as described in Drew's post.

This crawler would then output a database which can be fed into the data 
servers, in both centralized and decentralized forms (for example, after running
a crawl, the crawler could offer to also submit the database to the centralized
data server).

This should help build up a central database, while still making it simple to run
your own instance with a given focus.

In order to combat abuse, I would have the central server do some form of sanity
check against database, as well as blacklisting sites that appear to be 
blogspam or otherwise not useful.

###The data/search servers

These servers will take the databases outputed from the crawler, and use them
to respond to search requests.

However, unlike a traditional search engine, these servers will only respond
with data in some serialized format.

The purposes of this are multifold.

First, it simplifies the amount of development needed for the server, which seems
like a Good Thing, to me at least.

Second, it would allow for more *useful* data to be sent with the same amount of
bandwidth.

Third, it allows multiple search clients to be used, each with as many or as few
features as they please. One feature in particular that I would find useful would
be the ability to create a local search database that could be compared against
before sending out network requests.

###The client

The most basic client could be as simple as a web page which renders whatever data
it receives from the central server.

Some other features that could be included client-side:

* Building of a local search database.
    
* Blacklisting of domains that you personally find non-useful.
    
* A launcher for the web-crawler, which could be used to improve results 
  that you find lacking.
    
* A switcher for which search server you want to connect to.

And probably a few more, but the above would be my wishlist.

---

##Pros/Cons of this architecture.

Pros:

* In my mind at least, this strikes a good balance between de/centralization.
    
* The search server is much simplified down from what it would have to be 
  for a traditional engine.
    
* It should be more simple to write your own clients/servers, as you could 
  have standardized formats for search results and search databases.

Cons:

* Is dependent on user having access to a good client. There could of course
  be one provided, but unscrupulous people could sabotage the network by
  providing/advertising subpar providers.

* Possibly leans to far towards decentralization. (Would users running crawls
  submit them to the central server?)

* Related to the first two, is dependent on a given provider giving access to
  a good search server.

* Is designed by someone who doesn't know much about either crawling or 
  indexing, and therefore may be totally unviable for reasons I don't understand.