Wayback: Self-hosted archiving service integrated with Internet Archive

pdimitar · on April 16, 2023

Only 3 matches of the word "local" in the README and they don't seem to refer to any self-hosting whatsoever.

What does this tool do exactly? Send URLs to remote services for them to do snapshots?

That README is a classic case of a filter bubble syndrome. People, it can't physically hurt you to add 1-2 sentences saying what exactly does your tool do!

pmontra · on April 16, 2023

It also stores locally. One of the last bullet points is

> Supports storing archived files to disk for offline use

pdimitar · on April 16, 2023

Ah, I see it now, thanks. IMO it should be put front and center and be at the start of the project's description.

x0x0 · on April 16, 2023

The link kind of buried in the upper right explains

https://docs.wabarc.eu.org/

polynox · on April 16, 2023

I would not complain about a free open source tool because they don’t do a good job of marketing it.

swyx · on April 16, 2023

but it would be nice if they spent a bit of time documenting themselves a bit more, same as you would want from any coworker at a big company. this isnt just marketing its good developer hygiene .

pdimitar · on April 16, 2023

They can do whatever they want, obviously.

But since it's open source they're likely aiming for more developer mindshare. To have that they should make their message crystal clear.

angelmm · on April 16, 2023

I love all these kind of projects as I tend to be paranoid of losing good online content.

It’s also unclear to me how wWayback works. It seems more like an API than a self-hosted service.

I’m currently using ArchiveBox [0], which provides a complete API + UI.

- [0] https://archivebox.io/

rzzzt · on April 16, 2023

Are you using all extractors when saving a page?

I tried ArchiveBox and Shiori, but neither stuck for some reason. The latter is a bit more lightweight, it can save the entire page as well as a Readability-based conversion: https://github.com/go-shiori/shiori/

Linux-Fan · on April 16, 2023

I am not angelmm, but another happy ArchiveBox user.

My choice of extractors is the following: Singlefile, PDF, Screenshot, archive.org.

I found the largest issue with any website archiving tools to be the discrepancy between what I see in my Web Browser and what is saved. The most "reliable" way that still works today for me seems to be the "Save Page WE" Firefox plugin.

I have a sidecar container running that checks for HTML files appearing in a directory, triggers the archivebox save and then overwrites the "singlefile" capture by the provided HTML file. This way, I can trigger archiving by just using the Save Page WE plugin and storing the resulting HTML file in the directory.

zerkten · on April 16, 2023

Can archive box aggregate content "bookmarked" in different places?

I want a tool that will pull saved items on Reddit, favorite posts on HN, etc. in addition to bookmarks posted to pinboard.in to a single place. In many ways, the share functionality in iOS allows me to get all URLs into a single place, but this doesn't help on desktop.

I know this isn't necessarily and easy task, if APIs aren't available. I'd be OK with a client component or browser extension, if it was open source and self-hostable.

Linux-Fan · on April 16, 2023

AFAIK it does not contain any function to support this directly.

My primary way to interact with Archive Box for such purposes is by calling it on the command line. Scripts may be used to obtain the URLs of interest from any source.

When I started with Archive Box I had some existing downloaded Websites from ScrapBook and Save Page WE already. I used some hacky scripts to extract the URLs from the respective pages and overwrite Archive Box' downloaded copies by my original copies as to make it work for pages that had been deleted in the meantime. All my data sources were local/desktop though.

nikisweeting · on April 19, 2023

Yeah, you can set up scheduled imports from browser history, bookmarks services, RSS feeds, etc.

There are instructions for most of the common sources in the Input Sources section of the ArchiveBox readme :)

Handprint4469 · on April 16, 2023

ArchiveBox also saves a Readability version:

> Article Text: article.html/json Article text extraction using Readability & Mercury [0]

[0]: https://archivebox.io/#output-formats

mattrighetti · on April 16, 2023

When I initially saw the project I thought it was some kind of a WebArchive running locally, but this is not what it does right?

> Wayback one or more url to Internet Archive and archive.today

> Wayback url to Internet Archive or archive.today or IPFS

Correct me if I’m wrong but… does this forward links to those services and does not actually run the archiving process locally? In that case I would make the argument that it’s not really what self-hosted conveys.

bomewish · on April 16, 2023

Gotta agree that it's really darn confusing as to what this thing actually does. Seems like a pretty big project that some talented people have poured a lot of time into, but what am I getting into here?

Does it (1) download archived urls already on archive.org locally? or (2) spider websites I choose from scratch and save them locally (or on archive.org??)

Totally unclear. These are very different things. Want to install and try it but also a little wary given the opacity of explanation about the basic functionality and purpose of the tool!

navigate8310 · on April 16, 2023

Why is this better than simply passing a cURL such as archive.today/example.com/post1

pastage · on April 16, 2023

> Chromium: Wayback uses a headless Chromium to capture web pages for archiving purposes. [0]

I think it is a local spider. This quote is the only clear indication that they actually spider themselves. There are lots of implicit statements as well.

[0] https://docs.wabarc.eu.org/installation/

wabarc · on April 17, 2023

Correct, wayback has implemented remote and local archiving of various types of resources, including HTML, PDF, HAR, WARC, Readability (Markdown and Txt), and media files such as images, videos, and audio.

GTP · on April 16, 2023

I think that if you use the IPFS option, you can pin the generated files so you have them locally.

9dev · on April 16, 2023

> Wayback is a tool that supports running as a command-line tool and docker container, purpose to snapshot webpage to time capsules. > > Supported Golang version: See .github/workflows/testing.yml

The summary is hilarious! I still have not the slightest idea what it does, why I should care, or what it's good for. What are "time capsules", and what does "snapshot" mean in this context?

joshspankit · on April 16, 2023

It seems to make the assumption that people already know how the Internet Archive’s “wayback” service works, or at least what it’s purpose is.

The essence is that web pages change over time and can be taken down or otherwise lost. A “snapshot” is a capture a webpage at a specific time, and I assume that “time capsule” is some type of format that holds the snapshot as well as the extra metadata. The result is something that can be used later to see that website as it was.

squarefoot · on April 16, 2023

Not the best explanation for sure. It seems a tool that can be used to offload and potentially decentralize some archiving work from the Internet Archive, with some privacy/anonymity added to help against censorship. About time I'd say, however I'm not sure if that can be used for bare files as well: books, retrogaming, software, etc.

kwhitefoot · on April 16, 2023

If I want to keep a copy of a web page I use the SingleFile Firefox add-on. It saves a copy of the rendered page and can embed images as data URLs.

wabarc · on April 17, 2023

The author here, thank you all for your interest in Wayback as a tool. It is a hobby project, and due to limited effort and time, we have focused on improving specific aspects of the tool and making it more user-friendly, but we have not invested in marketing.

The readme structure of the project is still in its original version and is relatively behind the development process. However, we have recently added a basic documentation at https://docs.wabarc.eu.org, which still needs improvement.

During the initial stages of the project, the idea was rather unstructured, and we did not have a well-planned approach. It offers similar functionality to other tools of its kind, one of its most notable features is its integration with instant messaging (IM) tools such as Discord [0], Telegram [1], and etc [2].

At this time, Wayback is still in an early stage of development, having recently implemented remote and local archiving of various resources, including HTML, PDF, HAR, WARC, Readability (Markdown and Txt), and media files such as images, videos, and audio. One of our next major goals is to improve the tool's indexing capabilities.

Overall, we hope that our work on Wayback will be useful to others, even in its current state of development. We appreciate any feedback or suggestions you may have for making it better.

[0] https://discord.com/api/oauth2/authorize?client_id=863324809...

[1] https://t.me/wabarc_bot

[2] https://docs.wabarc.eu.org/service/#service

baq · on April 16, 2023

Perhaps this should be integrated into browsers? Bonus points for automatic push to the wayback machine in a privacy preserving manner…

klysm · on April 16, 2023

Seems hard to determine if a page has personal content. The archiving should probably only be done on clients that don’t have credentials

unintendedcons · on April 16, 2023

For archiving, look into https://github.com/dosyago/DiskerNet

It's real next gen thinking on this topic.

As for the featured tool wayback... If HN readers can't figure out what it does after reading docs, its likely the thinking behind it is equally unclear.

gala8y · on April 16, 2023

Looking at the link you gave does not help much in seeing what DiskerNet does and looks like, neither.

Keeping it simple, I download pages in Markdown adding some metadata (some tags). When I want images or more I use singlefile extension. Add Recoll to the mix and that's all I need.

https://github.com/deathau/markdownload

https://github.com/gildas-lormeau/SingleFile

https://www.lesbonscomptes.com/recoll/pages/index-recoll.htm...

kuschkufan · on April 16, 2023

What about videos (embedded or not)?

gala8y · on April 16, 2023

yt-dlp is good enough for most cases for me.

unintendedcons · on April 19, 2023

Replying because I can't edit:

I loved this project when it was called 22120 and was here https://github.com/c9fe/22120

It was AGPL3 and it was great! http://web.archive.org/web/20210123003545/https://github.com... The author links to an interview with an open source focus publication about it!

I should not have allowed the credit of the previous clearly explained open source project to carry forward after renaming, because it is now a different project.

I am sorry. I apologize. I will not recommend this software again, and I will try not to make this mistake in the future.

Thank you for helping me see this error.

mellosouls · on April 16, 2023

Does it actually record web archives (ie warcs)? I couldn't work out from a quick look at the repo whether it does that or not, though it claims to make your archives shareable.

There's a long-existing web archiving ecosystem with established formats for recording and publishing archives.

The repo linked could probably do with some clarity itself in how it does or doesn't fit in with standards.

uniqueuid · on April 16, 2023

diskernet is certainly interesting because it records archives as you browse.

But it's not open source and pretty limited regarding the use case.

The thing is, archiving is a multi-faceted and hard problem (i.e. video content, live streams, interactive sites), and will remain so. A complicated task leads to complicated tools.

jsiepkes · on April 16, 2023

DiskerNet looks cool but is apparently commercial software? Even for personal use you need a license I think?

JamesAdir · on April 16, 2023

Any recommendation for a tool that can crawl and download an entire website complely and save it locally?

thunderbong · on April 16, 2023

HTTrack Website Copier

https://www.httrack.com/

JamesAdir · on April 16, 2023

thanks, tried it, but it has a problem with Unicode URL's unless there is some setting I'm missing.

gala8y · on April 16, 2023

try adding +*{unicode} to scan rules. seriously, please try and let me know, if it worked.

JamesAdir · on April 18, 2023

I'm using the Windows version and still try to figure out the options. Will update if I'll find something.

arboles · on April 16, 2023

Browsertrix or Brozzler, which crawl using headless browsers for accuracy.

anthk · on April 16, 2023

http://theoldnet.com

For old or modern browsers without JS, such as DilloNG or Netsurf.