Building JobScout: Crawling the Internet for Freelance Projects
When you're a freelancer, the most annoying part of the job often isn't the work --- it's finding the work.
Freelance projects are scattered across dozens of platforms, agency websites, job boards, and forums. Every platform requires manual searching, filtering, and constant checking.
I built JobScout to solve exactly this problem.
JobScout continuously crawls the internet for freelance projects and sends relevant opportunities directly to developers based on their skills.
In this article I want to focus on the engineering behind the system --- how it crawls thousands of sources, processes the data, matches it with freelancers, and sends notifications efficiently.
The Core Idea
The idea behind JobScout is simple:
Instead of freelancers searching for projects, projects should find freelancers.
The system does three main things:
- Crawl project platforms
- Normalize and enrich project data
- Match projects with freelancers
At the time of writing, the system has already collected over 130,000 project listings and is growing continuously.
System Overview
The architecture is intentionally simple and pragmatic.
+------------------+
| Web Crawlers |
+--------+---------+
|
v
+------------------+
| Data Pipeline |
| (Normalization) |
+--------+---------+
|
v
+------------------+
| PostgreSQL |
| Project Store |
+--------+---------+
|
v
+------------------+
| Matching Engine |
+--------+---------+
|
v
+------------------+
| Email Notifier |
+------------------+
The system consists of four main components:
- Crawlers
- Data processing pipeline
- Matching engine
- Notification system
Each part has its own challenges.
Crawling the Internet for Projects
The hardest problem is simply collecting the data.
Freelance projects appear in many different formats:
- traditional job boards
- agency websites
- freelance marketplaces
- company career pages
- community forums
Each source has its own HTML structure, pagination logic, and sometimes anti-scraping mechanisms.
Instead of building a generic crawler, JobScout uses platform-specific crawlers.
Each crawler implements the same interface:
class PlatformCrawler:
def crawl(self) -> List[ScrapeResult]:
...
The result is a normalized object:
class ScrapeResult:
id: str
platform: str
title: str
description: str
company: str
location: str
rate: str
link: str
created_at: datetime
all_skills: List[str]
inferred_skills: List[str]
Every crawler is responsible for:
- fetching pages
- parsing HTML
- extracting project information
- returning normalized results
This approach has a huge advantage:
If one platform changes its layout, only a single crawler needs to be updated.
Handling Duplicate Projects
Many agencies repost the same projects on multiple platforms.
Without deduplication the system would quickly fill with duplicates.
To handle this, JobScout computes a content fingerprint based on:
- title
- description
- company
- link
Example:
hash = sha256(
(title + description + company).encode()
)
If the hash already exists in the database, the project is skipped.
This simple technique removes most duplicates without complex NLP.
Data Storage
All project data is stored in PostgreSQL.
The schema is intentionally simple:
projects
--------
id
platform
title
description
company
location
rate
link
created_at
all_skills
inferred_skills
Over time the dataset has grown to more than 130k projects, which opened the door to some interesting data science experiments like:
- project clustering
- skill demand analysis
- market trend detection
Matching Projects to Freelancers
Freelancers sign up on the landing page and enter their skills.
Example:
Python
AWS
Terraform
Kubernetes
Each project also contains extracted skills.
Matching is currently based on tag intersection.
Simplified version:
def matches(project, freelancer):
return len(
set(project.skills) &
set(freelancer.skills)
) > 0
If there is an overlap, the project is considered relevant.
This is intentionally simple but works surprisingly well.
Future versions will likely use:
- TF-IDF similarity
- embeddings
- semantic search
Notification Engine
Freelancers can configure how often they want to receive updates:
- every 24 hours
- every 12 hours
- every 6 hours
- every hour
The notification system works like a batch scheduler.
+------------+
| Scheduler |
+-----+------+
|
v
+------------+
| Match Jobs |
+-----+------+
|
v
+------------+
| Send Email |
+------------+
Instead of sending emails immediately, projects are aggregated and sent as digest emails.
This significantly reduces email volume and infrastructure cost.
Operating Costs
The infrastructure for JobScout is surprisingly cheap.
The entire system currently runs on a small server costing about:
~ $15 / month
Because most work is I/O bound (crawling and matching), the system does not require expensive compute resources.
With some optimizations the system could easily scale to tens of thousands of users on a small cluster.
Lessons Learned
Building JobScout taught me a few important lessons.
1. Scraping is messy
Every platform behaves differently. Layout changes are constant and you must design your crawlers to be easy to update.
2. Simple systems scale further than expected
The matching algorithm is extremely simple, yet users still find relevant projects.
3. Data becomes valuable over time
After collecting thousands of projects, the dataset itself becomes interesting.
You can analyze:
- which technologies are trending
- how freelance rates change
- which regions have the most projects
What's Next
There are several improvements planned for JobScout:
- AI-based project matching
- recommendation system
- recruiter project posting
- advanced analytics for freelancers
The goal is to evolve JobScout from a crawler into a data platform for the freelance market.
Final Thoughts
JobScout started as a small side project to solve a personal problem.
Today it has:
- hundreds of freelancers
- over 130,000 projects indexed
- a continuously growing dataset
The system is intentionally simple, but that's exactly why it works.
Sometimes the best products are just small tools that remove daily friction.
If you're curious, you can check it out here: