Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I used to do this for job postings in target companies and a few other purposes in heady era of 2008. One of the hassles is that some pages do not neatly contain their useful data in json, so the tiny deploy script becomes a little more complicated to support the various steps:

1. Collection: logins, curl

2. Pretty print: jq / tidy

3. Selectors: Beautiful Soup / jq

4. Annotation: svn / git commit

The Github Actions angle is new and welcome though.



Oh, and I should mention, it's pretty cool to see how companies change job ads on a daily basis. Beyond just the spelling fixes, I've seen places advertise help wanted postings for like a day and remove them the next day on a regular basis. Or tack on a 'on call 24/7/365' statement. Or reuse a job posting but tack on the word 'Manager.'


That's so interesting. Next time I'm job seeking I'll definitely look into doing this.


If you're interested in Australia or New Zealand, we can chat. I have a scraper running on my spare laptop now, and am going to try porting it to Github Actions so it continues without me.

The source data is all HTML not JSON though, and I have to scrape pages, then parse job IDs, and then re-scrape the job listings themselves. Having it as a SQLite database is more helpful than the default search: e.g. all jobs that don't include the phrase "right to live and work in this location", all jobs that have email addresses, GROUP BY advertiser - features I wish but don't expect would ever be added to the source site.


In retrospect it mostly didn't work. What I really learned from this is that help-wanted pages are a formality. I figure places that hire people who read HN pay recruiters, and applying directly with the company less often 'skips the line' and more often finds a direct line to the trashcan, since h1b hiring laws require employers try to find local talent work first.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: