Diensttagebuch

Day 2306 (25 Apr 2025)

Advanced Python features

https://blog.edward-li.com/tech/advanced-python-features/
- Protocols
- Python slots to make accessing fields of a class faster
- for-else statements to avoid if x is not None(): ...
- etc., TODO a lot of cool stuff
https://news.ycombinator.com/item?id=43770494

* did you know __init__.py is optional nowadays?

* you can do relative imports with things like "from ..other import foo"

* since 3.13 there is a @deprecated decorator that does what you think it does

* the new generics syntax also works on methods/functions: "def method[T](...)" very cool

* you can type kwargs with typeddicts and unpack: "def fn(*kwargs: Unpack[MyKwargs])"

* dataclasses (and pydantic) support immutable objects with: "class MyModel(BaseModel, frozen=True)" or "@dataclass(frozen=True)"

* class attributes on dataclasses, etc. can be defined with "MY_STATIC: ClassVar[int] = 42" this also supports abstract base classes (ABC)

* TypeVar supports binding to enforce subtypes: "TypeVar['T', bound=X]", and also a default since 3.13: "TypeVar['T', bound=X, default=int]"

* @overload is especially useful for get() methods to express that the return can't be none if the default isn't None

* instead of Union[a, b] or Optional[a] you can write "a | b" or "a | None" nowadays

* with match you can use assert_never() to ensure exhaustive matching in a "case _:" block

* typing has reveal_type() which lets mypy print the type it thinks something is

* typing's "Self" allows you to more properly annotate class method return types

* the time package has functions for monotonic clocks and others not just time()

Day 2305 (24 Apr 2025)

Ignoring stuff in flake8 pylint black etc.

Ignoring files:

# type: ignore
# flake8: noqa
# pylint: skip-file

vscode and cursor IDE settings

Autosave and format

    "files.autoSave": "onFocusChange",
    "[python]": {
        "editor.formatOnSave": true,
        // "editor.defaultFormatter": "charliermarsh.ruff",
		"editor.defaultFormatter": "ms-python.black-formatter",
		// reformat everything w/ ruff
        "editor.codeActionsOnSave": {
            "source.fixAll": "explicit",
            "source.organizeImports": "explicit"
        },
    },

For fix all etc. w/ ruff:

Adding rulers

    "editor.rulers": [
        78, 88
    ]

Day 2296 (15 Apr 2025)

Papers

[2303.16634] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment ¹
- Excellent
- Main pitch:
  - “Auto CoT evaluation”: make the LLM generate a chain-of-thoughts based on provided criteria
  - get the LLM evaluate texts based on that CoT as form-filling task
  - weight by the probabilities of the votes as given by the LLM
    - 1*chance-of-1+2*chance-of-2
- Compares own evaluators to multiple others
[2302.04166] GPTScore: Evaluate as You Desire²
- basics:
  - task: e.g. summarize
  - aspect: e.g. relevance
  - eval protocol: how likely it is that this text was generated given this task and this aspect?
- main pitch:
  - LLM-estimated likelihood of text based on task+aspect+result.
  - For summary: {Task_Specification} {Aspect_Definition} Text: {Text} Tl;dr: {Summ}
- aspects:
UniEval [2210.07197] Towards a Unified Multi-Dimensional Evaluator for Text Generation³
- one trained evaluator to rule them all, based on boolean Q/A
- “is this a coherent summary?” -> chances for yes/no -> profit

Approaches

Main goal: coherence with human judgement

Kinds

Reference-free

LLM-estimate the probability is a generated text
- assumption being that better texts have better probability
- downsides: unreliable / low human correspondence according to ¹
Ask an LLM how coherent / grammatical / criterium_name the output is from 0 to 5
- downsides: LLMs will often pick 3 but G-Eval works around that

Reference-based

BLEU/ROUGE and friends (n-gram based)
- low correlation with human metrics
(semantic) Similarity between two texts based on embeddings (BERTScore/MoverScore)
GPTScore²: higher probability to texts better fitting the framing
FACTUAL CONSISTENCY of generated summaries: FactCC, QAGS⁴

Framings (for GPT/LLM-based metrics)

Form-filling (G-Eval)
Conditional generation (=probability): GPTScore

How to evaluate evaluations

Meta-evaluators

Meta-evaluators are a thing! (= datasets based on human ratings)
- G-Eval¹ uses
  - SummEval ⁵
  - TopicalChat⁶
  - QAGS⁴ evaluating factual consistency/hallucinations
- ¹
Correlation with human metrics

Stats

Many papers use various statistical bits to compare them.
G-Eval (p. 6)
- Spearman, Kendall-Tau correlations
- Krippendorf’s Alpha

On a specific task

Task formulation

Input:
- Advert for a car.
  
  Car has 4 wheels
- Target profile of possible client
  
  family with 10 kids 5 dogs living in the Australian bush
Output:
- Advert targeted for that profile:
  
  ROBUST car with 4 EXTRA LARGE WHEELS made of AUSTRALIAN METAL able to hold 12 KIDS and AT LEAST 8 DOGS

Concrete approaches

Factual correctness / hallucinations
- Framed as summarization evaluation: FactCC, QAGS⁴
GPTScore, G-Eval based on yet-to-be-determined-criteria
- Coherence, consistency, fluency, engagement, etc. — the criteria used by meta-evaluators?
- GPTScore aspects
- TODO: ask the car company what do they care about except factual correctness and grammar
information density-ish? Use an LLM to get key-value pairs of the main info in the advert (number_of_wheels: 4), formulate questions based on each, and score better the adverts that contain answers to more questions!
- Inspired by QAGS

TODO

LLM-as-a-judge and comparative assestment
think about human eval + meta-eval
think about what we want from the car company
- metrics they care about

G-Eval: <_(@liuGEvalNLGEvaluation2023) “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment” (2023) / Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu: z / http://arxiv.org/abs/2303.16634 / 10.48550/arXiv.2303.16634 _> ↩︎ ↩︎ ↩︎ ↩︎
<_(@fuGPTScoreEvaluateYou2023) “GPTScore: Evaluate as You Desire” (2023) / Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu: z / http://arxiv.org/abs/2302.04166 / 10.48550/arXiv.2302.04166 _> ↩︎ ↩︎
<_(@zhongUnifiedMultiDimensionalEvaluator2022) “Towards a Unified Multi-Dimensional Evaluator for Text Generation” (2022) / Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, Jiawei Han: z / http://arxiv.org/abs/2210.07197 / 10.48550/arXiv.2210.07197 _> ↩︎
<_(@wangAskingAnsweringQuestions2020) “Asking and Answering Questions to Evaluate the Factual Consistency of Summaries” (2020) / Alex Wang, Kyunghyun Cho, Mike Lewis: z / http://arxiv.org/abs/2004.04228 / 10.48550/arXiv.2004.04228 _> ↩︎ ↩︎ ↩︎
<_(@fabbriSummEvalReevaluatingSummarization2021) “SummEval: Re-evaluating Summarization Evaluation” (2021) / Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev: z / https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation / 10.1162/tacl_a_00373 _> ↩︎
<_(@gopalakrishnanTopicalChatKnowledgeGroundedOpenDomain2023) “Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations” (2023) / Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tur: z / http://arxiv.org/abs/2308.11995 / 10.48550/arXiv.2308.11995 _> ↩︎

Day 2295 (14 Apr 2025)

Latex referencing figure without caption

Tried to use subfig for two figures side by side, but couldn’t \autoref it.
I had a caption for the individual subfigs but not for the large figure itself. As soon as I added the caption it worked.

  \begin{figure}%
    \centering
    \subfloat[\centering caption subfig 1]{{\includegraphics[width=0.4\linewidth]{images/fig2.png}}}%
    \qquad
    \subfloat[\centering caption subfig 2]{{\includegraphics[width=0.4\linewidth]{images/fig4.png} }}%
    \caption{Without }
    \label{fig:twosamples}%
\end{figure}

\autoref{fig:twosamples}

Day 2270 (19 Mar 2025)

Pydantic validation blues

Pydantic’s FilePath is like Path except that the file has to exist and be a file.

BUT FilePath when validating expects a string as input, not a Path! (in other words: FilePath(Path) doesn’t seem to work)

So when I create a Validator that converts str into Path ¹:

@field_validator("filename", mode="before") 
@classmethod 
def parse_filename(cls, value: str | Path) -> Path: 
	return Path(value)

I get a wonderful

>       doc = UCFDocument.model_validate_json(json_string)
E       pydantic_core._pydantic_core.ValidationError: 1 validation error for UCFDocument
E       filename
E         Input is not a valid path for <class 'pathlib.Path'> [type=path_type, input_value=PosixPath('/home/sh/w/cor...n/doc.pdf_data/doc.pdf'), input_type=PosixPath]

tests/ucf/test_data_structures.py:179: ValidationError

Again, the error is a PosixPath not being a Path, though it is one:

E         Input is not a valid path for <class 'pathlib.Path'> [type=path_type, input_value=PosixPath('/home/sh/w/cor...n/doc.pdf_data/doc.pdf'), input_type=PosixPath]

# explicitly expecting a PosixPath creates an even better
E         Input is not a valid path for <class 'pathlib.PosixPath'> [type=path_type, input_value=PosixPath('/home/sh/w/cor...n/doc.pdf_data/doc.pdf'), input_type=PosixPath]

Not intuitive at all.

Solution is to give FilePath strings and only strings, or drop FilePath to begin with.

├── pydantic v2.10.6
│   ├── annotated-types v0.7.0
│   ├── pydantic-core v2.27.2
│   │   └── typing-extensions v4.12.2

(don’t ask why I needed this, this is a minimal reproducible example only) ↩︎

CVAT for image labeling

CVAT is a really neat labelling platform, online + free on-premise w/ Docker.
(Github: cvat-ai/cvat: Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.)

I like it more than label studio for images, has more functions, but is also “heavier” / bulkier.

Love how it supports even 600mb 4-channel TIFF satellite images and is quite fast at that.

Bits:

Enable auto-save every N minutes
<C-a> for snipping polygons to existing polygon ponits
“Backup project” to re-import later, “Export project” to get e.g. YOLO annotations

Day 2268 (17 Mar 2025)

Current LLM evaluation landscape

The Open LLM Leaderboard is dead¹, as good time as any to look for new eval stuff!

HF universe
- The Open LLM Leaderboard people are actually OpenEvals (OpenEvals), and they created other cool stuffs
  - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends eval suite
    - I like their documentation for new task / new model
  - huggingface/evaluation-guidebook: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
    - the contents are awesome, w/ examples, model-as-a-judge, etc.
- evaluate-metric (Evaluate Metric) recommended by them as a guide to existing metrics
Harnesses
- Evalverse: Unified and Accessible Library for Large Language Model Evaluation meta-thing that can run different harnesses based on the target. Github repo archived? UpstageAI/evalverse: The Universe of Evaluation. All about the evaluation for LLMs.
- open-compass/opencompass: OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
  - don’t really like their documentation on adding datasets/models
- The venerable stanford-crfm/helm: Holistic Evaluation of Language Models (HELM)
- openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Resources / articles:
- How to evaluate by Meta: Validation | How-to guides
- HF eval guidebook: huggingface/evaluation-guidebook

open-llm-leaderboard/open_llm_leaderboard · It’s been a wild ride, folks :) (end of the Open LLM Leaderboard) ↩︎

Day 2264 (13 Mar 2025)

Exporting gitea projects

Backup and Restore | Gitea Documentation has the full detailed story.

The easy stupid way for backing up gitea running in docker, ~~untested and~~ allegedly will fail if DB was being used during dumping.

docker exec -it --user git gitea-container bash

gitea dump

# then outside the container, copy from the gitea container to host OSooj

docker cp gitea-container:/whatveer/gitea-dump.zip /tmp

Importing: the docs don’t have the correct paths, not easy to follow.

EDIT: if your docker has /data mounted somewhere local, just copying that directory somewhere might work.

Installing SSDs into M.2 slots and drive stuff

M.2 slots have keys: How do I install an M.2 SSD on my computer? - Transcend Information, Inc.
B+M means both B and M slots are acceptable for the B+M module

CLI: sudo lshw -C disk tells you all disks h

serhii.net