GrabLinks

Synopsis

grablinks.py is a simple and streamlined Python 3 script to extract and filter links from a remote HTML resource.

Requirements

An installation of Python 3 (any version above 3.5 should do fine). Additionally the 3rd-party Python modules requests and beautifulsoup4 are required. Both modules can be easily installed with Python's package manager pip, e.g.:

pip --install requests --user
pip --install beautifulsoup4 --user

Usage

usage: grablinks.py [-h] [-V] [--insecure] [-f FORMATSTR] [--fix-links]
                    [-c CLASS] [-s SEARCH] [-x REGEX]
                    URL

Extracts, and optionally filters, all links (`<a href=""/>') from a remote
HTML document.

positional arguments:
  URL                   a fully qualified URL to the source HTML document

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version number and exit
  --insecure            disable verification of SSL/TLS certificates (e.g. to
                        allow self-signed certificates)
  -f FORMATSTR, --format FORMATSTR
                        a format string to wrap in the output: %url% is
                        replaced by found URL entries; %text% is replaced with
                        the text content of the link; other supported
                        placeholders for generated values: %id%, %guid%, and
                        %hash%
  --fix-links           try to convert relative and fragmental URLs to
                        absolute URLs (after filtering)

filter options:
  -c CLASS, --class CLASS
                        only extract URLs from href attributes of <a>nchor
                        elements with the specified class attribute content.
                        Multiple classes, separated by space, are evaluated
                        with an logical OR, so any <a>nchor that has at least
                        one of the classes will match.
  -s SEARCH, --search SEARCH
                        only output entries from the extracted result set, if
                        the search string occurs in the URL
  -x REGEX, --regex REGEX
                        only output entries from the extracted result set, if
                        the URL matches the regular expression

Report bugs, request features, or provide suggestions via
https://github.com/the-real-tokai/grablinks/issues

Usage Examples

# extract wikipedia links from 'www.example.com':
$ grablinks.py 'https://www.example.com/' --search 'wikipedia'
https://ja.wikipedia.org/wiki/仲間由紀恵
https://ja.wikipedia.org/wiki/黒木華
https://ja.wikipedia.org/wiki/清野菜名
…

# extract download links from 'www.example.com', create a shell script
# on-the-fly and pass it along to sh to fetch things with wget:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' --format 'wget "%url%"' | sh
# Note: Do not do that at home. It is dangerous! 😱

# extract/ handle links like
# <a href="https://example.com/crypricnumber">properfilename.dat</a>
$ grablinks.py 'https://www.example.com/' --format 'wget '\''%url%'\'' -O '\''%text%'\' > fetchfiles.sh
$ sh fetchfiles.sh
# Note: %text% is not sanitized by grablinks.py for safe shell usage. It is
#       recommended to verify this before executing things automatically

History

1.6	2-Dec-2023	Added '--insecure' argument to disable SSL/TLS certificate verification Added support for '%text%' placeholder in format string (<a>text</a>)
1.5	24-Nov-2022	Added a (fixed) timeout to the remote request.
1.4	30-May-2022	Improved handling of passing multiple classes to '--class'.
1.3	6-Feb-2021	Fix: handling of common edge cases when '--fix-links' is used.
1.2	16-Aug-2020	Fix: in some cases links from "<a>" tags without a 'class' attribute were not part of the result.
1.1	7-Jun-2020	Initial public source code release

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
grablinks.py		grablinks.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

.gitignore

.gitignore

.pylintrc

.pylintrc

LICENSE

LICENSE

README.md

README.md

grablinks.py

grablinks.py

requirements.txt

requirements.txt

Repository files navigation

GrabLinks

Synopsis

Requirements

Usage

Usage Examples

History

About

Releases

Packages

Languages

License

the-real-tokai/grablinks

Folders and files

Latest commit

History

Repository files navigation

GrabLinks

Synopsis

Requirements

Usage

Usage Examples

History

About

Topics

Resources

License

Stars

Watchers

Forks

Languages