This release is an open dataset made in line with [Danbooru 2018 set](https://nyaa.si/view/1176129).
It covers 1.227.622 thumbnail (512x512px images) from several imageboards combined with supporting metadata.
**NOTE THIS IS AN OBSOLETE VERSION OF DATASET, modern version consists of [2021](https://nyaa.si/view/1384820), [2015](https://nyaa.si/view/1468367) and [2022](https://nyaa.si/view/1547662) volumes**
- much larger (2.7+M images) and better (sample size 1280/1024px w/o black boxes)
- more tag metadata, better file naming, most valuable tags placed to EXIF
- more computed metadata (incl. boundboxes)
- suitable for mobile browsing ...
**NEVERTHELESS, THIS RELEASE ALSO SUPPORTED. The main features here are:**
- good original images technical and visual quality
* width>=900 height>=900 MPixels>=1.2
* most of comixes, primitives, overtexted images manually excluded
* no photo, almost no characterless scenes
- several sources but unique image identification **%website% + %id%**
* most of original images can be found in torrents (nyaa, rutracker)
* selective regrab of originals possible if source website available
- careful deduplication with relative website priorities, high to low (mostly)
* safebooru.org
* yande.re
* e-shuushuu.net
* konachan.com
* gelbooru.com
* chan.sankakucomplex.com
* zerochan.net
* anime-pictures.net
* danbooru.donmai.us
* tbib.org
- image file names mostly structured and contains **%website% - %id% - %copyright% ~ %characters% (%artist%)**
- not completely SFW (a little bit softcore ecchi here and there)
Images timeline covers 10.2016 - 08.2019 densely, earlier period selectively, by "volumes":
**V2019** - 11.2018-08.2019 taken from rip https://nyaa.si/view/1202653
**V2018** - period 2017-2018 from rips https://nyaa.si/view/1181364
https://www.acgnx.se/show-cceb3260269b5423cbd7f8d59f2c84531750923b.html
https://nyaa.si/view/771715 and https://nyaa.si/view/513582
and (russian) https://rutracker.org/forum/viewtopic.php?t=5478026
**V2016** - till 10.2016 from https://nyaa.si/view/891391
partially used https://nyaa.si/view/750972 and https://nyaa.si/view/875411
**V2016W** - till 05.2016 converted to wallpapes sizes
https://nyaa.si/view/710893, https://nyaa.si/view/745633
and https://rutracker.org/forum/viewtopic.php?t=5198985
**V2018D** - remainder from https://nyaa.si/view/1176129 survived after cleanup and deduplication, mostly 2015 and earlier
files renamed according to metadata, white backgrounds for addon-2018 replaced with black ones
#### Metadata:
- copyrights, characters and artists taglist based on Danbooru tags
* copyrights bundled into Franchises
* characters refers to Franchises
* copyrights and characters refer to Myanimelist entities
- images statistical properties from JPG header and calculated
* entropy (complexity), skewness (darkness)
* colors count and intensity by channels
* color saturation (grayness), edge intensity
* boundbox coordinates and more
- face detection results (Nagadomi) with 3 level of accuracy combined
- complete Safebooru 407.424 posts copyright / characters / artist metadata
* safebooru string tags with Danbooru tag-ids
* Franchises wherever applicable
#### Software:
- Windows BAT scripts for processing with Image Magick
- Python scripts for some grabbing and processing
**This dataset may be used for massive localized image processing and [meta-]data mining,** e.g.
- scene scale and composition classification, species recognition algorithms training / estimation
- visual quality and attractiveness ranking / prediction
- any imaginable metadata query with their visualized results on fingertips
Comments - 1
SomaHeir