Only 3 matches of the word "local" in the README and they don't seem to refer to any self-hosting whatsoever.
What does this tool do exactly? Send URLs to remote services for them to do snapshots?
That README is a classic case of a filter bubble syndrome. People, it can't physically hurt you to add 1-2 sentences saying what exactly does your tool do!
but it would be nice if they spent a bit of time documenting themselves a bit more, same as you would want from any coworker at a big company. this isnt just marketing its good developer hygiene .
I tried ArchiveBox and Shiori, but neither stuck for some reason. The latter is a bit more lightweight, it can save the entire page as well as a Readability-based conversion: https://github.com/go-shiori/shiori/
I am not angelmm, but another happy ArchiveBox user.
My choice of extractors is the following: Singlefile, PDF, Screenshot, archive.org.
I found the largest issue with any website archiving tools to be the discrepancy between what I see in my Web Browser and what is saved. The most "reliable" way that still works today for me seems to be the "Save Page WE" Firefox plugin.
I have a sidecar container running that checks for HTML files appearing in a directory, triggers the archivebox save and then overwrites the "singlefile" capture by the provided HTML file. This way, I can trigger archiving by just using the Save Page WE plugin and storing the resulting HTML file in the directory.
Can archive box aggregate content "bookmarked" in different places?
I want a tool that will pull saved items on Reddit, favorite posts on HN, etc. in addition to bookmarks posted to pinboard.in to a single place. In many ways, the share functionality in iOS allows me to get all URLs into a single place, but this doesn't help on desktop.
I know this isn't necessarily and easy task, if APIs aren't available. I'd be OK with a client component or browser extension, if it was open source and self-hostable.
AFAIK it does not contain any function to support this directly.
My primary way to interact with Archive Box for such purposes is by calling it on the command line. Scripts may be used to obtain the URLs of interest from any source.
When I started with Archive Box I had some existing downloaded Websites from ScrapBook and Save Page WE already. I used some hacky scripts to extract the URLs from the respective pages and overwrite Archive Box' downloaded copies by my original copies as to make it work for pages that had been deleted in the meantime. All my data sources were local/desktop though.
When I initially saw the project I thought it was some kind of a WebArchive running locally, but this is not what it does right?
> Wayback one or more url to Internet Archive and archive.today
> Wayback url to Internet Archive or archive.today or IPFS
Correct me if I’m wrong but… does this forward links to those services and does not actually run the archiving process locally? In that case I would make the argument that it’s not really what self-hosted conveys.
Gotta agree that it's really darn confusing as to what this thing actually does. Seems like a pretty big project that some talented people have poured a lot of time into, but what am I getting into here?
Does it (1) download archived urls already on archive.org locally? or (2) spider websites I choose from scratch and save them locally (or on archive.org??)
Totally unclear. These are very different things. Want to install and try it but also a little wary given the opacity of explanation about the basic functionality and purpose of the tool!
> Chromium: Wayback uses a headless Chromium to capture web pages for archiving purposes. [0]
I think it is a local spider. This quote is the only clear indication that they actually spider themselves. There are lots of implicit statements as well.
Correct, wayback has implemented remote and local archiving of various types of resources, including HTML, PDF, HAR, WARC, Readability (Markdown and Txt), and media files such as images, videos, and audio.
> Wayback is a tool that supports running as a command-line tool and docker container, purpose to snapshot webpage to time capsules.
>
> Supported Golang version: See .github/workflows/testing.yml
The summary is hilarious! I still have not the slightest idea what it does, why I should care, or what it's good for. What are "time capsules", and what does "snapshot" mean in this context?
It seems to make the assumption that people already know how the Internet Archive’s “wayback” service works, or at least what it’s purpose is.
The essence is that web pages change over time and can be taken down or otherwise lost. A “snapshot” is a capture a webpage at a specific time, and I assume that “time capsule” is some type of format that holds the snapshot as well as the extra metadata. The result is something that can be used later to see that website as it was.
Not the best explanation for sure. It seems a tool that can be used to offload and potentially decentralize some archiving work from the Internet Archive, with some privacy/anonymity added to help against censorship.
About time I'd say, however I'm not sure if that can be used for bare files as well: books, retrogaming, software, etc.
The author here, thank you all for your interest in Wayback as a tool. It is a hobby project, and due to limited effort and time, we have focused on improving specific aspects of the tool and making it more user-friendly, but we have not invested in marketing.
The readme structure of the project is still in its original version and is relatively behind the development process. However, we have recently added a basic documentation at https://docs.wabarc.eu.org, which still needs improvement.
During the initial stages of the project, the idea was rather unstructured, and we did not have a well-planned approach. It offers similar functionality to other tools of its kind, one of its most notable features is its integration with instant messaging (IM) tools such as Discord [0], Telegram [1], and etc [2].
At this time, Wayback is still in an early stage of development, having recently implemented remote and local archiving of various resources, including HTML, PDF, HAR, WARC, Readability (Markdown and Txt), and media files such as images, videos, and audio. One of our next major goals is to improve the tool's indexing capabilities.
Overall, we hope that our work on Wayback will be useful to others, even in its current state of development. We appreciate any feedback or suggestions you may have for making it better.
As for the featured tool wayback... If HN readers can't figure out what it does after reading docs, its likely the thinking behind it is equally unclear.
Looking at the link you gave does not help much in seeing what DiskerNet does and looks like, neither.
Keeping it simple, I download pages in Markdown adding some metadata (some tags). When I want images or more I use singlefile extension. Add Recoll to the mix and that's all I need.
I should not have allowed the credit of the previous clearly explained open source project to carry forward after renaming, because it is now a different project.
I am sorry. I apologize. I will not recommend this software again, and I will try not to make this mistake in the future.
Does it actually record web archives (ie warcs)? I couldn't work out from a quick look at the repo whether it does that or not, though it claims to make your archives shareable.
There's a long-existing web archiving ecosystem with established formats for recording and publishing archives.
The repo linked could probably do with some clarity itself in how it does or doesn't fit in with standards.
diskernet is certainly interesting because it records archives as you browse.
But it's not open source and pretty limited regarding the use case.
The thing is, archiving is a multi-faceted and hard problem (i.e. video content, live streams, interactive sites), and will remain so. A complicated task leads to complicated tools.
What does this tool do exactly? Send URLs to remote services for them to do snapshots?
That README is a classic case of a filter bubble syndrome. People, it can't physically hurt you to add 1-2 sentences saying what exactly does your tool do!