Is there a reason, other than the BeautifulSoup library, that Python is considered by many to be the ideal language for web scraping? I would think that JavaScript would be a far better choice since it could natively parse scripts on the page and libraries for querying and parsing the DOM have existed for a long time in JavaScript and are well known (to the point of being boring -- eg: jQuery).
You don't really get any benefit from writing it in javascript, other than the normal benefits you get from writing anything in javascript. (I say this having very little experience with server-side javascript, so take it with a grain of salt)
DoM emulation and selectors are pretty much equivalent between nodejs and python, you can use css or xpath selectors on html/xml content on either of them. Either way you need to emulate something like a DoM, as neither language/execution-environment has a "native" DoM.
You don't want to execute random javascript code from the web inside your scraper, and just being able to parse the scripts doesn't do you much good. So you're not getting the main advantage I think you're suggesting, being able to emulate page javascript, being able to actually run that code.
Generally if you want to interact with javascript you need to do it in another process (I guess a sufficiently advanced sandbox could work too, an interpreter in your interpreter, but so far that doesn't exist). If you're already going to be running that javascript in a different process for security reasons that different process might as well just be a "remote controlled" web browser.
Historically that was done using selenium, which has good python bindings.
Now days it's being done more with playwright, which started out as a nodejs binding but is moving towards python....
Ultimately I think the reason is that there's no real advantage to using javascript and python is a nicer language with a healthier ecosystem, but your mileage my vary.
Actually one big advantage I see is the ability to quickly come up with needed functions and code from Browser DevTools then use the exact same code in a node script.
Personally I use this method with Puppeteer for advanced pages such as Single Page Apps (SPA) and other pages that depend on JavaScript, CSS, or other features in the page. Another example of an advanced page would be a site where you have to psychical scroll and wait for content to load from a web service. In these cases a headless browser with JavaScript makes the most sense to me.
I've found where it gets tricky with JavaScript is if you have a single missing `async/await` you can introduce bugs in your code that take extra time to solve.
For simple pages I do like Python and that you don't need `async/await`.
Selenium and playwright both allow you to inject javascript directly into the page, which can be nice.
I see your point though. Also when I do playwright scripting I normally use async/await, so I guess the grass is always greener ;p
In python I find a missing async/await is apparently very early on and doesn't really take extra time to solve. Maybe just better tracebacks in python?
If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.
Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.
rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.
BeautifulSoup is great if you don't care about the performance at all. Because it is painfully slooooooww.
Lxml doesn't work well with broken html, but is an or two orders of magnitude faster for parsing, and same for querying with xpath.
A part from that, there is also Scrapy which is used a lot, but same it is also very slow, it is just horizontally scalable easily.
There are a lot of times in which scrapping doesn't use html parsing, when you are scrapping pages which change a lot of structure, it might be better to go with full text search, and in this case, the faster the better. And in that area Python is far from the best, except when .split() and .join() are enough. Even re.match is slow because of the algorithm they use is slow
And to finish, Requests is also super slow, if you want something fast you have to use pycurl.
Does Scrapy's slow speed actually matter much? Your main bottleneck is always going to be network calls and rate limiting. I don't know how much optimization can help there.
Libxml is pretty slow (lxml uses it). Selectolax is 5 times faster for simple CSS queries. It is basically a thin wrapper for a well optimized HTML parser written in C.
I was unaware, I always use bs4 with lxml for parsing xml just because I like the interface. For what I'm doing, the bottleneck is the remote system/network, so it doesn't really matter. But now I'm curious about which parts are slower and why. Maybe I'll run some experiments later.
Executing javascript and being to render a HTML page are completely different things. To render an HTML page you need a way to create a DOM, donwload all ressources, ... An Node gives you no advantage, as you have to use another lib for that
True - that's why you run scrapers using Playwright or Selenium - both of which can easily be scripted from either JavaScript or Python, while executing website code in a sandboxed browser instance.
That's when you break out PySelenium (if you want to stick with python). Many languages work with selenium drivers, I don't think there's much point in debating which language is best for scraping. Probably one that supports threads, it depends on the scale of course and how much performance you want.