Despite my own promises, here is a hybrid followup for both [BOORU_CHARS datasets](https://nyaa.si/view/1740396) and [safebooru-centric composite rips](https://nyaa.si/view/1733499).
This time a main source was **danbooru** (safe+questionable, interval **ID 6640000..8200000 = 31.08.2023..24.09.2024**),
"the best of" furry-related **e621** and loli-enabled **gelbooru** for the same interval were used as addon.
Similar to rips :
- images initially filtered Mpixels>=0.48, shorter_side>=600 px, volume>=60000 bytes, no animations
stripes dropped or cropped to aspect ratio 0.4..2.1
- PNG/WEBP/AVIF converted to JPG using **cjpegli 96% quality** (2000000 bytes limit)
modest downsampling done to typical longer side 2560px (landscape) 1920px (1x1) 2480px (portrait)
- verbose file naming used **"%website% - %id% - %up_to_3_copyrights% ~ %up_to_5_characters% (%up_to_2_artists%).jpg"**
files uniquely identified by "%website%+%id%"
Similar to datasets extensive processing done and used for content sorting :
- some general image statistics got with EXIFTOOL and [IMAGE MAGICK](https://imagemagick.org)
- content analisys was mostly the same as BC2023 with actual software and models
- [CRAFT text detector](https://github.com/fcakyon/craft-text-detector) used to estimate total size and number of text pieces
- torso components detected with [custom PyTorch model](https://github.com/aperveyev/booru_yolo/tree/main/models)
being built over [Ultralitics YOLOv11](https://github.com/ultralytics/ultralytics)
- imageboard tags arranged and partially placed inside image EXIF-info
- clustering implemented both
- by aspect ratio { 7x10 +/-4% ; 3x4 +/-10% ; 1x1 +/-20% ; 3x2 +/-40% ; 2x3 +/-40% }
- by detected head-count { 0 heads = letter A, 2 = B, 3-5 heads = C, 6+ heads = D, 1 head = letter E }
- sorting inside cluster based on "attractiveness score function" == "colorful and textless"
- balanced folder/zip typically contains ~1000-2600 files
- least rated images tend to be manga-like and manually reviewed
Content is a little less processed and a little more NSFW compared to predecessors.
Nevertheless :
- real-life photos, no-character landscapes, foods and macro thrown away
- most of comic and N-koma, overtexted images and line-arts filtered out
- too "questionable" images (uncensored nipples or vulva, obvious adult actions) excluded >> BOORU BOOBS planned
- some background crops, gamma correction, rotation, denoise and other nontrivial improvements implemented
Images deduplicatied using [AntiDupl](https://github.com/ermig1979/AntiDupl) up to 2% similarity along with BOORU CHARS 2023 and 2022.
Beside images release contains tab separated texts :
- **BC_2024.tsv** file/image related metadata **1.260.629 rows**
- **BC_2024_tags.tsv** tags list with enrichment 49.041.220 rows
- **BC_2024_yolo.tsv** detailed results for torso components detection 4.431.887 rows
- **BC_2024_yolov11m_aa22.pt** PyTorch YOLOv11 model to get a picture below
and also dedicated "readme" with structures description.
Keep in mind this release is first of all
**a dataset of character-centric art in effective local format suited for batch processing**
and then
**a representative catalog of anime/game/cartoon copyrights, characters and artists for visual estimation**
but **not
a complete and maximum quality rip.**
Some tips on use cases :
```
@REM -- loop explore torrent zips
for %%F in ("d:\torr\BOORU_CHARS_2024\2024-3x4\*.zip") do 7z x -r -o"C:\TEMP\" "%%F" *sousou*frieren*stark*
@REM -- or
for /R d:\torr\BOORU_CHARS_2024 %%J in (*.zip) do 7z x -r -o"C:\TEMP\" "%%J" *sousou*frieren*stark*
@REM -- much more effective if unzipped
xcopy /s "A:\BOORU_CHARS_2024\*sousou*frieren*stark*" C:\TEMP\
-- and became sophisticated using database (copy-paste result to just_do_it.BAT)
select 'xcopy "'||bc.fpath||'\'||bc.fname||'" C:\TEMP\' xcpy
from bc
join bc_dt d on d.booru=bc.booru and d.fid=bc.fid
where bc.fname like '%dungeon%meshi%senshi%' and d.tag='pantyshot' -- brutal dwarf fanservice
```
Attention picker : head diversity with torso join (custom redraw) e621-4825390 . . . . . . . . . . . . . . . . . bust and belly variations (raw YOLO detect output) danbooru-7007859

Comments - 1
SomaHeir