THIS IS THE ADDON to [BOORU CHARS 2021](https://nyaa.si/view/1384820) and [BOORU CHARS 2015](https://nyaa.si/view/1468367) torrents
It covers ~98% newcoming images from composite rips
[01.2022 - 05.2022](https://nyaa.si/view/1539363) volume V2022B "double size"
[11.2021 - 01.2022](https://nyaa.si/view/1486179) volume V2022A
[08.2021 - 11.2021](https://nyaa.si/view/1462329) volume V2021D
[06.2021 - 08.2021](https://nyaa.si/view/1452049) volume V2021C
[03.2021 - 06.2021](https://nyaa.si/view/1409571) volume V2021B
**No substantial changes happened in image processing workflow, features stills the same:**
1) files unique identified with (booru + fid) imageboard name and file ID key
verbose file naming **%booru% - %fid% - %up-to-3-copyrights% ~ %up-to-5-characters% (%up-to-2-artists%)**
2) aspect ratio clustering - with freeware Dimensions2Folders
priorities high to low 7x10 +/-4% >> 3x4 +/-10% >> 1x1 +/-20% >> 3x2 +/-40% >> 2x3 +/-40%
3) file format unified - as of composite rips
4) sampling 1280px longest side (1024x1024^ for 1x1 +/-20% aspect ratio), re-MOGRIFY to 94% for 98-100% JPEG quality done
5) imageboard tags arranged and partially placed inside image EXIF-info
6) some general image statistics got with [IMAGE MAGICK](https://imagemagick.org)
**DEEP CONTENT ANALYSIS produce bounboxes**
7) [KERAS-CRAFT](https://github.com/notAI-tech/keras-craft) text detector used to estimate total size and number of text pieces
8) [YOLOv5 detector](https://github.com/aperveyev/booru_yolo) number of heads used for folder/archive distribution, detected torso components assembled
Simple numerical rank among all images has been built over each of numerical criteria,
so both outlier processing and ranking deal only with relative ranks 1..maxN or simple functions using it.
**Identical to BC2015:**
- "attractiveness score function" turned to definition "textless and colorful"
- ~2% outliers to delete were defined as (ranked independently)
* purely presented : partially filled (min boundbox) OR least detailed (min enthropy) OR too bright / dark (max skewness modulus)
* richest of text (maximum text pieces count and / or total text area)
* most "crowded" (lots of tiny and mostly unjoined heads detected)
* JPEG artifacted (tiny or inflated after mogrify)
**This release contains:**
- **705.467 sampled images**
* clustered by aspect ratio and also number of heads detected (0=letter A, 2=B, 3+=C, 1=letters D&E inside folder name)
* ordered and grouped into 1000-th zip/folders by "attractiveness score function"
- rich image-related metadata (BC_2022.tsv, tab separated text)
- full tags list with Danbooru enrichment (BC_2022_tags.tsv)
- detailed results for keras & yolo detection algorythms
- sample code (commandline, python, PL/SQL) for key algorithms - not "ready to use" but building blocks
Comments - 1
Phantom132