Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there a reason, other than the BeautifulSoup library, that Python is considered by many to be the ideal language for web scraping? I would think that JavaScript would be a far better choice since it could natively parse scripts on the page and libraries for querying and parsing the DOM have existed for a long time in JavaScript and are well known (to the point of being boring -- eg: jQuery).


You don't really get any benefit from writing it in javascript, other than the normal benefits you get from writing anything in javascript. (I say this having very little experience with server-side javascript, so take it with a grain of salt)

DoM emulation and selectors are pretty much equivalent between nodejs and python, you can use css or xpath selectors on html/xml content on either of them. Either way you need to emulate something like a DoM, as neither language/execution-environment has a "native" DoM.

You don't want to execute random javascript code from the web inside your scraper, and just being able to parse the scripts doesn't do you much good. So you're not getting the main advantage I think you're suggesting, being able to emulate page javascript, being able to actually run that code.

Generally if you want to interact with javascript you need to do it in another process (I guess a sufficiently advanced sandbox could work too, an interpreter in your interpreter, but so far that doesn't exist). If you're already going to be running that javascript in a different process for security reasons that different process might as well just be a "remote controlled" web browser.

Historically that was done using selenium, which has good python bindings.

Now days it's being done more with playwright, which started out as a nodejs binding but is moving towards python....

Ultimately I think the reason is that there's no real advantage to using javascript and python is a nicer language with a healthier ecosystem, but your mileage my vary.


Actually one big advantage I see is the ability to quickly come up with needed functions and code from Browser DevTools then use the exact same code in a node script.

Personally I use this method with Puppeteer for advanced pages such as Single Page Apps (SPA) and other pages that depend on JavaScript, CSS, or other features in the page. Another example of an advanced page would be a site where you have to psychical scroll and wait for content to load from a web service. In these cases a headless browser with JavaScript makes the most sense to me.

I've found where it gets tricky with JavaScript is if you have a single missing `async/await` you can introduce bugs in your code that take extra time to solve.

For simple pages I do like Python and that you don't need `async/await`.


Selenium and playwright both allow you to inject javascript directly into the page, which can be nice.

I see your point though. Also when I do playwright scripting I normally use async/await, so I guess the grass is always greener ;p

In python I find a missing async/await is apparently very early on and doesn't really take extra time to solve. Maybe just better tracebacks in python?


If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.

Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.

rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.

[1] https://github.com/gocolly/colly

[2] https://github.com/go-rod/rod


This is a thoughtful response, I don’t understand why it’s being downvoted.


Me either, alas


BeautifulSoup is great if you don't care about the performance at all. Because it is painfully slooooooww.

Lxml doesn't work well with broken html, but is an or two orders of magnitude faster for parsing, and same for querying with xpath.

A part from that, there is also Scrapy which is used a lot, but same it is also very slow, it is just horizontally scalable easily.

There are a lot of times in which scrapping doesn't use html parsing, when you are scrapping pages which change a lot of structure, it might be better to go with full text search, and in this case, the faster the better. And in that area Python is far from the best, except when .split() and .join() are enough. Even re.match is slow because of the algorithm they use is slow

And to finish, Requests is also super slow, if you want something fast you have to use pycurl.


In my experience selectolax is about 10x faster than lxml, and keeps the familiar CSS selector API: https://rushter.com/blog/python-fast-html-parser/


Does Scrapy's slow speed actually matter much? Your main bottleneck is always going to be network calls and rate limiting. I don't know how much optimization can help there.


Selectolax is nice, much faster than bs4 or lxml. Not a very well known project yet though.

Not sure there's anything faster on the javascript side of the fence?


If they beat lxml it is pretty impressive. Too bad that they don't support xpath


Libxml is pretty slow (lxml uses it). Selectolax is 5 times faster for simple CSS queries. It is basically a thin wrapper for a well optimized HTML parser written in C.


Beautiful Soup can use lxml, and does by default for parsing xml.


There is a big speed difference between lxml alone and lxml + bs4


I was unaware, I always use bs4 with lxml for parsing xml just because I like the interface. For what I'm doing, the bottleneck is the remote system/network, so it doesn't really matter. But now I'm curious about which parts are slower and why. Maybe I'll run some experiments later.


Yes, the difference is infinite with broken HTML which is checks notes a huge chunk of the Internet.


Maybe because of its historical position in the scientific/data science ecosystem?

Django and Flask are also very popular libraries, so the language and culture gap isn’t as large as it may seem.


Python is indeed far from ideal for scraping in the modern web, but for only one reason: It can't execute javascript.

As a result, js generated content cannot be scraped, and python scrapers also get blocked very fast as they don't execute fingerprinting scripts.


Executing javascript and being to render a HTML page are completely different things. To render an HTML page you need a way to create a DOM, donwload all ressources, ... An Node gives you no advantage, as you have to use another lib for that


Don't execute random javascript from the web in nodejs, nodejs isn't sandboxed the same way that web browsers are and it generally won't work any way.

That opens up massive security problems.


True - that's why you run scrapers using Playwright or Selenium - both of which can easily be scripted from either JavaScript or Python, while executing website code in a sandboxed browser instance.


Is there a similar guide that walks through step by step how to perform scraping using one of those sandboxes?


Not a guide but a doc I’ve used before is listed below. I use webdriver to open Firefox.

https://selenium-python.readthedocs.io/


That's when you break out PySelenium (if you want to stick with python). Many languages work with selenium drivers, I don't think there's much point in debating which language is best for scraping. Probably one that supports threads, it depends on the scale of course and how much performance you want.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: