Комментарии / Профиль datasecrets

26.03.2025 14:23:48

Wed, 26 Mar 2025 14:23:48 GMT

Да достаточно, конечно, но не дает представления о том, у каких именно пространств есть отображение, а у каких нет. Громов ставил свой вопрос с целью именно это выяснить

10.12.2024 07:22:29

Tue, 10 Dec 2024 07:22:29 GMT

Добавили в начало публикации. Еще продублируем тут: https://blog.google/technology/research/google-willow-quantum-chip/

Спасибо, что подметили)

19.11.2024 14:42:49

Tue, 19 Nov 2024 14:42:49 GMT

Да, действительно. Хотя вот СEO Anthropic и GTM OpenAI, когда комментировали последние новости, высказывали предположение, что масштабирование продолжится. Просто, возможно, будет не в претрейне, как мы привыкли, а в test-time трейнинге или ризонинге. Так что посмотрим!

19.11.2024 14:37:46

Tue, 19 Nov 2024 14:37:46 GMT

В большинстве современных моделей (в частности, об этом можно судить по Stable Diffusion) внутри диффузионки зашит UNet c cross-attention. Кроме того текстовые энкодеры в таких генеративных моделях – это тоже трансформеры. Вот, например, цитата из статьи SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis:

In particular, and in contrast to the original Stable Diffusion architecture, we use a heterogeneous distribution of transformer blocks within the UNet: For efficiency reasons, we omit the transformer block at the highest feature level, use 2 and 10 blocks at the lower levels, and remove the lowest level (8× downsampling) in the UNet altogether — see Tab. 1 for a comparison between the architectures of Stable Diffusion 1.x & 2.x and SDXL. We opt for a more powerful pre-trained text encoder that we use for text conditioning. Specifically, we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the penultimate text encoder outputs along the channel-axis [1]. Besides using cross-attention layers to condition the model on the text-input, we follow [30] and additionally condition the model on the pooled text embedding from the OpenCLIP model.

02.10.2024 14:25:59

Wed, 02 Oct 2024 14:25:59 GMT

Да, вы правы, конечно. Каждый бустинг ансамбль, но не каждый ансамбль бустинг)) Возможно, автор как-то вручную контролировал выделение метода в другую категорию из-за заслуживающей отдельного внимания популярности именно бустинга как самостоятельного алгоритма, а не как части ансамблей. В оригинальном эссе об этом информации нет

12.09.2024 18:24:27

Thu, 12 Sep 2024 18:24:27 GMT

Да, вы правы. Исправили в тексте!

14.05.2024 12:43:49

Tue, 14 May 2024 12:43:49 GMT

Вот тут демо можно посмотреть, их показывали на презентации: https://x.com/estebandiba/status/1790285228981862720

14.05.2024 09:39:35

Tue, 14 May 2024 09:39:35 GMT

Выдержка из поста:

Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.