我们以下面的这一句语句作为开始，以从本地加载模型为例。

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B")
inputs = tokenizer.encode(q.strip()+" ? To answer this question, we need to know", return_tensors="pt")
outputs = model.generate(inputs.cuda(), max_new_tokens=100, do_sample=False, top_k=50)

加载

AutoModelForSeq2SeqLM继承了_BaseAutoModelClass，这个类是所有AutoModel的基类，保存在transformers/models/auto/auto_factory.py中。调用的from_pretrained方法实际上就来自于这个基类。我们假设模型保存在本地，一些下载的逻辑不看，且kwargs和config为None。最终会得到模型的哈希值。

if not isinstance(config, PretrainedConfig):
    # We make a call to the config file first (which may be absent) to get the commit hash as soon as possible
    resolved_config_file = cached_file(
        pretrained_model_name_or_path,
        CONFIG_NAME,
        _raise_exceptions_for_gated_repo=False,
        _raise_exceptions_for_missing_entries=False,
        _raise_exceptions_for_connection_errors=False,
        **hub_kwargs,
    )
    commit_hash = extract_commit_hash(resolved_config_file, commit_hash)

首先需要加载config，通过cached_file来加载。CONFIG_NAME默认为config.json，pretrained_model_name_or_path则是from_pretrained传入的字符串。再看具体实现。

path_or_repo_id = str(path_or_repo_id) # "bigscience/T0_3B"
    full_filename = os.path.join(subfolder, filename) # filename=config.json
    if os.path.isdir(path_or_repo_id):
        resolved_file = os.path.join(os.path.join(path_or_repo_id, subfolder), filename) # subfolder不指定=None
        if not os.path.isfile(resolved_file):
            if _raise_exceptions_for_missing_entries:
                raise EnvironmentError(
                    f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout "
                    f"'https://huggingface.co/{path_or_repo_id}/tree/{revision}' for available files."
                )
            else:
                return None
        return resolved_file # 返回bigscience/T0_3B/config.json

此时对应的config.json已经被加载到内存中，之后需要加载到AutoConfig中。可以看到就是T0_3B/config.json

            config, kwargs = AutoConfig.from_pretrained(
                pretrained_model_name_or_path,
                return_unused_kwargs=True,
                trust_remote_code=trust_remote_code,
                code_revision=code_revision,
                _commit_hash=commit_hash,
                **hub_kwargs,
                **kwargs,
            )
---------config---------
T5Config {
  "_name_or_path": "/data2/wtf/model/bigscience/T0_3B",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "gradient_checkpointing": false,
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.38.2",
  "use_cache": true,
  "vocab_size": 32128
}

最终将模型加载，返回model实例。model_class就是json中的architectures对应值。_get_model_class方法就是得到模型的类型！

          model_class = _get_model_class(config, cls._model_mapping) # T5ForConditionalGeneration
  		"""
  		"architectures": [
  "T5ForConditionalGeneration"
]
			就是返回arch中的模型类型
  		"""
          return model_class.from_pretrained(
              pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
          )

上面的cls._model_mapping就是模型根据你的输入，得到当前模型类型的映射。

class AutoModelForSeq2SeqLM(_BaseAutoModelClass):
    _model_mapping = MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING
------
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = _LazyAutoMapping(
    CONFIG_MAPPING_NAMES, MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES
)

CONFIG_MAPPING_NAMES是根据你传入的路径来匹配，而MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES同理，也有和下图类似的字典。本例中CONFIG_MAPPING_NAMES="T5Config"，MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES=T5ForConditionalGeneration，和config.json中的一样。

model_class.from_pretrained中的from_pretrained来自于PretrainedModel类，这是所有模型的基类(注意不是AutoModel)。这个函数是本篇的核心。返回的模型实例默认是开启model.eval()模式，若要微调or训练模型，需要手动指定model.train()。

这里顺便提一嘴model.train和eval下的区别：

Dropout 和 BatchNorm 行为。model.train()下 Dropout 层会随机丢弃一部分神经元, BatchNorm 层会计算当前 batch 的统计量。 model.eval()下Dropout 层会全部保留神经元, BatchNorm 层会使用训练好的滑动平均统计量。
梯度与优化器。model.train()会计算梯度并更新模型参数。model.eval()不会计算梯度, 也不会更新模型参数。
数据增强。model.train()通常会应用一些数据增强技术, 如翻转、旋转等。model.eval()一般不需要数据增强, 直接使用原始的输入数据。
内存与计算开销。model.train()需要保存中间激活值用于反向传播, 计算开销相对更大。model.eval()只需要前向传播, 不需要保存中间激活值, 计算开销相对更小。

下面看几个比较关键的参数。

@classmethod
def from_pretrained(
    cls,
    pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
    *model_args,
    config: Optional[Union[PretrainedConfig, str, os.PathLike]] = None,
    cache_dir: Optional[Union[str, os.PathLike]] = None,
    ignore_mismatched_sizes: bool = False,
    force_download: bool = False,
    local_files_only: bool = False,
    token: Optional[Union[str, bool]] = None,
    revision: str = "main",
    use_safetensors: bool = None,
    **kwargs,
):

pretrained_model_name_or_path，模型的路径or在huggingface中的名字。
force_download，不论有没有下载好模型，都下载，若存在则覆盖。
torch_dtype，torch.float16 or torch.bfloat16 or torch.float，指定模型参数的载入精度，不指定则默认为torch.float。也就是说，config.json 中的 torch_dtype 设置拥有最高优先级。如果 torch_dtype 参数被设置为 "auto"，那么它会首先使用 config.json 中的设置。只有当 config.json 中没有找到 torch_dtype 且 torch_dtype 参数被设置为 "auto"时，它才会回退到使用权重checkpoint中的数据类型，查看第一个数据是什么类型就用什么类型。若根本没有设置该参数，则使用torch.float。
device_map，可以传入三种类型的参数。字符串类型，如果传入一个字符串类型的设备名称(例如 "cpu", "cuda:1", "mps")，那么整个模型会被分配到指定的设备上，如果传入 "auto"，Accelerate 库会自动计算出最优的设备分布。字典类型，这种情况下 device_map 是一个字典,键是模型的子模块名称,值是对应的设备编号或设备名称，这允许用户手动指定模型的各个子模块应该分布在哪些设备上，只需要指定到模块名称的级别,子模块会自动分配到同一设备，如
```
device_map = {
    "transformer.encoder": "cuda:0",
    "transformer.decoder": "cuda:1",
    "transformer.pooler": "cuda:0",
    "lm_head": "cuda:1"
}
```
还可以传入整数或torch.device，代表将整个模型放在指定编号的 GPU 上。如device = torch.device("cuda:1"),device_map = deveice。只要指定了device_map，那么都会让 low_cpu_mem_usage=True。不指定就用cpu。
quantization_config，指定模型的量化策略。可以是一个字典或者继承自 QuantizationConfigMixin 的对象，它用于配置模型的量化参数。除了 quantization_config 之外,还可以使用 load_in_4bit 和 load_in_8bit 等参数来指定量化方式,但这种方式不被推荐，只量化了参数，并不量化梯度。但推理阶段无所谓。下面是一个例子。

import bitsandbytes as bnb
from transformers import QuantizationConfig

quantization_config = QuantizationConfig(
    quantization_method=bnb.QuantizationMethod.DYNAMIC_QUANT,
    weight_bits=8,# 权重为INT8
    grad_bits=8,# 梯度也INT8
    per_channel=False
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    "bigscience/T0_3B",
    quantization_config=quantization_config
)

local_files_only，如果是True，则不会从Hub上下载。
low_cpu_mem_usage，作用是尝试在加载模型时不使用超过模型大小 1 倍的 CPU 内存(包括峰值内存)。
attn_implementation，可以选择flash_attention_2,sdpa(default),eager(手动实现)

之后看几处比较关键的源码。

从传入的路径中提取config。

# Load config if we don't provide a configuration
if not isinstance(config, PretrainedConfig):
    config_path = config if config is not None else pretrained_model_name_or_path
    config, model_kwargs = cls.config_class.from_pretrained

量化操作。注意到量化操作会强制开启low_cpu_mem_usage。

pre_quantized = getattr(config, "quantization_config", None) is not None
if pre_quantized or quantization_config is not None:
    if pre_quantized:
        config.quantization_config = AutoHfQuantizer.merge_quantization_configs(
            config.quantization_config, quantization_config
        )
    else:
        config.quantization_config = quantization_config
    hf_quantizer = AutoHfQuantizer.from_config(config.quantization_config, pre_quantized=pre_quantized)
else:
    hf_quantizer = None

if hf_quantizer is not None:
    hf_quantizer.validate_environment(
        torch_dtype=torch_dtype, from_tf=from_tf, from_flax=from_flax, device_map=device_map
    )
    torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype)
    device_map = hf_quantizer.update_device_map(device_map)

    # Force-set to `True` for more mem efficiency
    if low_cpu_mem_usage is None:
        low_cpu_mem_usage = True
        logger.warning("`low_cpu_mem_usage` was None, now set to True since model is quantized.")
is_quantized = hf_quantizer is not None

加载权重，tf相关的就不看了。在加载pytorch权重中，会去你指定的文件夹中找pytorch_model.bin这个权重文件。subfolder,variant若不在参数中指定都为空字符。

elif os.path.isfile(
    os.path.join(pretrained_model_name_or_path, subfolder, _add_variant(WEIGHTS_NAME, variant))
):
    # Load from a PyTorch checkpoint,会拼成model_path/pytorch_model.bin
    archive_file = os.path.join(
        pretrained_model_name_or_path, subfolder, _add_variant(WEIGHTS_NAME, variant)
    )

一些模型的权重可能以多个checkpoint文件来保存，这时候要求有一个WEIGHTS_INDEX_NAME = "pytorch_model.bin.index.json"文件来进行索引。

                elif os.path.isfile(
                    os.path.join(pretrained_model_name_or_path, subfolder, _add_variant(WEIGHTS_INDEX_NAME, variant))
                ):
                    # Load from a sharded PyTorch checkpoint
                    archive_file = os.path.join(
                        pretrained_model_name_or_path, subfolder, _add_variant(WEIGHTS_INDEX_NAME, variant)
                    )
                    is_sharded = True # 注意这里
----
{
    "checkpoint_files": ["pytorch_model.bin.0", "pytorch_model.bin.1", "pytorch_model.bin.2"],
    "num_checkpoint_files": 3,
    "size_checkpoint_files": [100000, 200000, 50000],
    "weight_map": {
        "layer1.weight": [0, 0],
        "layer1.bias": [0, 50000],
        "layer2.weight": [1, 0],
        "layer2.bias": [1, 100000]
    }
}

还有一种情况，就是指定的路径不是一个文件夹，而是权重文件本身，如bigscience/T0_3B/pytorch_model.bin，那也可以加载。因为最终都是让archive_file = weight_file。

elif os.path.isfile(os.path.join(subfolder, pretrained_model_name_or_path)):
    archive_file = pretrained_model_name_or_path
    is_local = True

最终，resolved_archive_file = archive_file，获取权重文件路径。如果是分散的checkpoint，也就是is_sharded是True，还要进行额外的操作，这里就不深入了。

接下来就要加载权重了，首先判断是不是pytorch，若是，则加载权重文件。详细的加载源码就不赘述，最终会返回由torch.load加载模型结构和权重参数。

if from_pt:
    if not is_sharded and state_dict is None:
        # Time to load the checkpoint
        state_dict = load_state_dict(resolved_archive_file)

接下来决定权重的数据类型，正如上面交代torch_dtype参数所说，先考虑torch_dtype=auto，也就是config.json中的数据类型。然后再考虑

if torch_dtype is not None:
    if isinstance(torch_dtype, str):
        if torch_dtype == "auto":
            if hasattr(config, "torch_dtype") and config.torch_dtype is not None:
                torch_dtype = config.torch_dtype
                logger.info(f"Will use torch_dtype={torch_dtype} as defined in model's config object")

        else:
            raise ValueError(
                f'`torch_dtype` can be either `torch.dtype` or `"auto"`, but received {torch_dtype}'
            )
    dtype_orig = cls._set_default_torch_dtype(torch_dtype)

若是分片情况，则去分片json中找有没有指定。如果不是分片的情况，则按权重文件中第一个数据的类型。若不显式指定torch_dtype(None)，则使用float32。

                    else:
                        if is_sharded and "dtype" in sharded_metadata:
                            torch_dtype = sharded_metadata["dtype"]
                        elif not is_sharded:
                            torch_dtype = get_state_dict_dtype(state_dict)
                        else:
                            one_state_dict = load_state_dict(resolved_archive_file[0])
                            torch_dtype = get_state_dict_dtype(one_state_dict)
                            del one_state_dict  # free CPU memory
                        logger.info(
                            "Since the `torch_dtype` attribute can't be found in model's config object, "
                            "will use torch_dtype={torch_dtype} as derived from model's weights"
                        )    
-------------get_state_dict_dtype(state_dict)---------
# if no floating dtype was found return whatever the first dtype is
else:
    return next(state_dict.values()).dtype

还有混合精度的情况，在初始nn.Module的时候可以设置单独设置_keep_in_fp32_modules哪些模块保持fp32精度。

# Check if `_keep_in_fp32_modules` is not None
use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and (
    (torch_dtype == torch.float16) or hasattr(hf_quantizer, "use_keep_in_fp32_modules")
)

创建模型实例。

with ContextManagers(init_contexts):
    # Let's make sure we don't run the init function of buffer modules
    model = cls(config, *model_args, **model_kwargs)

来看device_map的逻辑。首先是字符串的情况，必须要是"auto", "balanced", "balanced_low_0", "sequential"这几种，否则报错。"auto"会自动操作，尽可能均匀地分配计算负载。balanced则是平均分配模型层中的参数给不同的卡。balanced_low_0则是少给0分一些，因为0往往还有其他事情要做。sequential则是按模型层的顺序来分配给不同的卡，保持模型层的拓扑结构,减少跨设备的数据传输，如attention一张卡，MLP一张卡。

if device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]:
        raise ValueError
if device_map != "sequential":
                max_memory = get_balanced_memory(
                    model,
                    dtype=target_dtype,
                    low_zero=(device_map == "balanced_low_0"),
                    max_memory=max_memory,
                    **device_map_kwargs,
                )

实际上可以看到，auto就是balanced策略。

其他情况，输入的什么设备就绑定什么设备，若没有指定device_map，就加载到cpu。

if isinstance(device_map, torch.device):
     device_map = {"": device_map}
elif isinstance(device_map, str) and device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]:
      try:
          device_map = {"": torch.device(device_map)}
elif isinstance(device_map, int):# 小于0报错
	device_map = {"": device_map}

这里有一个tie_weights函数，实现了参数的绑定操作，本质上就是默认让输入嵌入层和输出嵌入层的权重绑定在一起。若是在config中指定is_encoder_decoder=True且tie_encoder_decoder=True，那么Encoder和Decoder的参数也会共用(都使用Decoder的Weights)，不过T5中并不这么做，一般是BERT-based模型在微调成Encoder-Decoder模型的时候会这么做。

def tie_weights(self):
    """
    Tie the weights between the input embeddings and the output embeddings.

    If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the
    weights instead.
    """
    if getattr(self.config, "tie_word_embeddings", True):
        output_embeddings = self.get_output_embeddings()
        if output_embeddings is not None:
            self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())

    if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False):
        if hasattr(self, self.base_model_prefix):
            self = getattr(self, self.base_model_prefix)
        self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)

    for module in self.modules():
        if hasattr(module, "_tie_weights"):
            module._tie_weights()

开启model.eval。

model.eval()

若模型是生成式模型，那么还需要配置生成的参数。GenerationConfig实际上就是model.generate()方法中所要用的参数。

if model.can_generate() and pretrained_model_name_or_path is not None:
    try:
        model.generation_config = GenerationConfig.from_pretrained(
            pretrained_model_name_or_path,
            cache_dir=cache_dir,
            force_download=force_download,
            resume_download=resume_download,
            proxies=proxies,
            local_files_only=local_files_only,
            token=token,
            revision=revision,
            subfolder=subfolder,
            _from_auto=from_auto_class,
            _from_pipeline=from_pipeline,
            **kwargs,
        )
    except OSError:
        logger.info(
            "Generation config file not found, using a generation config created from the model config."
        )
        pass

最终输出一些加载参数时输出的信息，然后返回模型。

if output_loading_info:
    if loading_info is None:
        loading_info = {
            "missing_keys": missing_keys,
            "unexpected_keys": unexpected_keys,
            "mismatched_keys": mismatched_keys,
            "error_msgs": error_msgs,
        }
    return model, loading_info

总结，根据最常用的方法，主要是做以下几个操作。

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B",device_map="auto")

根据输入路径拿到config.json，加载到Config。
根据输入路径拿到权重文件pytorch_model.bin，由torch.load加载模型结构和权重参数。。
决定权重的数据类型，未指定则是float32。
平均分配参数给各张卡。
绑定input和output的Embedding，让其使用同一份Embedding参数。
model.eval()。
若是生成式模型，配置生成参数。
返回模型实例。

所以，实际上会调用两个不同的from_pretrained方法，第一个是AutoModel基类_BaseAutoModelClass的，在最后调用get_model_class方法得到模型本身的类，本例中是T5ForConditionalGeneration，然后再调用这个类的from_pretrained，而这个类的from_pretrained在其基类PreTrainedModel实现，所以再会调用PreTrainedModel的from_pretrained方法。分析完毕。

分词

后续更新，挖坑

生成

并不是每一个模型都可以使用.generate()进行序列生成，需要通过函数判断是否能够进行序列生成任务，所以每一个模型都需要重写prepare_inputs_for_generation方法。

class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMixin, PeftAdapterMixin):
	    @classmethod
    def can_generate(cls) -> bool:
        """
        Returns whether this model can generate sequences with `.generate()`.

        Returns:
            `bool`: Whether this model can generate sequences with `.generate()`.
        """
        # Detects whether `prepare_inputs_for_generation` has been overwritten, which is a requirement for generation.
        # Alternativelly, the model can also have a custom `generate` function.
        if "GenerationMixin" in str(cls.prepare_inputs_for_generation) and "GenerationMixin" in str(cls.generate):
            return False
        return True

而generate方法本身就在GenerationMixin类中实现。

Class that holds a configuration for a generation task. A `generate` call supports the following generation methods
for text-decoder, text-to-text, speech-to-text, and vision-to-text models:

    - *greedy decoding* by calling [`~generation.GenerationMixin._greedy_search`] if `num_beams=1` and
        `do_sample=False`
    - *contrastive search* by calling [`~generation.GenerationMixin._contrastive_search`] if `penalty_alpha>0.`
        and `top_k>1`
    - *multinomial sampling* by calling [`~generation.GenerationMixin._sample`] if `num_beams=1` and
        `do_sample=True`
    - *beam-search decoding* by calling [`~generation.GenerationMixin._beam_search`] if `num_beams>1` and
        `do_sample=False`
    - *beam-search multinomial sampling* by calling [`~generation.GenerationMixin._beam_sample`] if
        `num_beams>1` and `do_sample=True`
    - *diverse beam-search decoding* by calling [`~generation.GenerationMixin._group_beam_search`], if
        `num_beams>1` and `num_beam_groups>1`
    - *constrained beam-search decoding* by calling [`~generation.GenerationMixin._constrained_beam_search`], if
        `constraints!=None` or `force_words_ids!=None`
    - *assisted decoding* by calling [`~generation.GenerationMixin._assisted_decoding`], if
        `assistant_model` or `prompt_lookup_num_tokens` is passed to `.generate()`

You do not need to call any of the above methods directly. Pass custom parameter values to '.generate()'. To learn
more about decoding strategies refer to the [text generation strategies guide](../generation_strategies).

就不看详细的实现，先看GenerationConfig的生成策略。若不用GenerationConfig，也可以直接输入kwargs。

greedy decoding贪婪解码，num_beams=1且do_sample=False，会一直选择概率最高的token，一条路走到黑。
Contrastive search对比搜索，在NIPS22被提出，能在保持流畅性的前提下，鼓励多样性生成，减少重复输出。需要penalty_alpha>0 and top_k>1。一个候选token与当前token非常相似(相似度得分高)，那么它的概率就会被较多地降低。这样做的目的是鼓励生成更加多样化的文本，避免同类型的token过于集中出现。最后,算法在经过调整的 scores 向量上取Top-1。核心公式scores = (1.0 - alpha) * next_top_k_probs - alpha * scores。next_top_k_probs是当前token的Top-k概率分布，等式右边的scores是当前token和下一个token之间的相似度分数。所以当前token与next token越相似，惩罚就越大。
multinomial sampling，num_beams=1 anddo_sample=True。和贪婪解码的区别不一定选择概率最高的token，而是根据概率分布来采样。
beam-search，num_beams>1 anddo_sample=False，保留top-k个得分最高的候选序列，称为"beam"。这里选择不采样，是选择得分最高的2*num_beams个token。
diverse beam-search，num_beams>1 and num_beam_groups>1，通过分组机制，确保了不同beam之间的差异性。

接下来介绍一些比较常用的参数。

do_sample，是否根据概率分布采样。
temperature，默认1.0。小于1时，当 temperature < 1.0 时, 生成概率分布会被"平滑"(峰值变得更陡峭)，使得模型更倾向于选择概率较高的token，生成的文本会更加集中和保守。当temperature > 1.0时，生成概率分布会被"拉平"(峰值变得更平缓)，使得模型会选择概率较低的token，生成的文本会更加多样和探索性。
top_k，选择下一个token时，只保留概率最高的前 top_k 个token，有效地避免模型选择概率很低的不合理token。
top_p，动态地选择概率总和达到 top_p 阈值的最小token集合。
num_return_sequences ，指定要生成的独立序列数量。默认为1，即只生成1个序列。
output_scores是否输出每个token的预测分数。
output_logits是否输出未经处理的原始预测logits。
pad_token_id,bos_token_id,eos_token_id，需要根据模型的词表来看。不设置则为None。
max_length: 最大输出长度,包括prompt和生成的新tokens。默认是20
max_new_tokens: 最大生成新tokens数量,不包括prompt长度。
min_length: 最小输出长度,包括prompt和生成的新tokens。
min_new_tokens: 最小生成新tokens数量,不包括prompt长度。
early_stopping: 控制beam search停止的条件。可选值为:
- True: 当生成了 num_beams 个完整候选序列时立即停止。
- False: 根据启发式停止,即当很难找到更好的候选时停止。
- "never": 一直运行直到无法找到更好的候选为止。

接下来解析generate函数主要做了哪些。

@torch.no_grad()
def generate(
    self,
    inputs: Optional[torch.Tensor] = None,
    generation_config: Optional[GenerationConfig] = None,
    logits_processor: Optional[LogitsProcessorList] = None,
    stopping_criteria: Optional[StoppingCriteriaList] = None,
    prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
    synced_gpus: Optional[bool] = None,
    assistant_model: Optional["PreTrainedModel"] = None,
    streamer: Optional["BaseStreamer"] = None,
    negative_prompt_ids: Optional[torch.Tensor] = None,
    negative_prompt_attention_mask: Optional[torch.Tensor] = None,
    **kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:

inputs，一般是经过tokenizer处理的序列，包含attention_mask的。如果是调用tokenizer.encode()，那么不会有attention_mask。

# 1. Handle `generation_config` and kwargs that might update it, and validate the `.generate()` call
"处理 `generation_config` 和可能更新它的 `kwargs`，并验证 `.generate()` 的调用，略"

# 2. Set generation parameters if not already defined
"""
略，设置生成所需的一些默认参数
"""
if generation_config.pad_token_id is None and generation_config.eos_token_id is not None:
    if model_kwargs.get("attention_mask", None) is None:
        logger.warning(
            "The attention mask and the pad token id were not set. As a consequence, you may observe "
            "unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results."
        )
    eos_token_id = generation_config.eos_token_id
    # 多语言模型的eos可能会有多个
    if isinstance(eos_token_id, list):
        eos_token_id = eos_token_id[0]
    logger.warning(f"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation.")
    generation_config.pad_token_id = eos_token_id

这一段说明确说明如果你没有传入pad_token_id，那么会以eos_token_id替代。若你没有传入attention_mask，会警告你传入，Attention Mask中值为0的位置对应的Attention权重设为非常小的负值，通常是-1e9。

接下来就是处理模型的输入。获取输入的tensor和batch_size。_prepare_model_inputs方法过滤掉 model_kwargs中非空且不是模型主要输入的参数。对于文本生成模型，要看模型的encoder是否支持直接输入embedding，否则一律设置成input_ids。

# 3. Define model inputs
# inputs_tensor has to be defined
# model_input_name is defined if model-specific keyword input is passed
# otherwise model_input_name is None
# all model-specific keyword inputs are removed from `model_kwargs`
inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(
    inputs, generation_config.bos_token_id, model_kwargs
)
batch_size = inputs_tensor.shape[0]

decode-only的模型应该使用左对齐。使用右对齐会警告，初始化 tokenizer 时设置 padding_side='left'以确保正确的生成结果。接下来的逻辑都不看了，无非就是处理一些模型生成参数，如max_length。根据不同的生成策略，会运行不同的生成函数，就看一个简单的贪婪解码的部分代码。

        if generation_mode == GenerationMode.GREEDY_SEARCH:
            # 11. run greedy search
            result = self._greedy_search(
                input_ids,
                logits_processor=prepared_logits_processor, # logits处理器，min_length作用于这个，在满足前减小eos的概率。
                stopping_criteria=prepared_stopping_criteria, # 停止判定器，max_length就作用于这个
                pad_token_id=generation_config.pad_token_id,
                eos_token_id=generation_config.eos_token_id,
                output_attentions = generation_config.output_attentions，# 是否输出注意力层分数
                output_hidden_states = generation_config.output_hidden_states # 是否返回隐藏状态
                output_scores=generation_config.output_scores,
                output_logits=generation_config.output_logits,
                return_dict_in_generate=generation_config.return_dict_in_generate,
                synced_gpus=synced_gpus,
                streamer=streamer,
                **model_kwargs,
            )
-----------------------------------------------------------------------------------

首先拿到全部的输出，并只需要下一个token的内容。next_tokens在序列结束的情况下，一定是pad_id。生成后更新input_ids，若生成了eos_id，就认为序列已经完成。

while self._has_unfinished_sequences(this_peer_finished, synced_gpus, device=input_ids.device):            
  		outputs = self(
              **model_inputs,
              return_dict=True,
              output_attentions=output_attentions,
              output_hidden_states=output_hidden_states,
          )
          next_token_logits = outputs.logits[:, -1, :] # 最后一个时间步
          next_tokens_scores = logits_processor(input_ids, next_token_logits)
          next_tokens = torch.argmax(next_tokens_scores, dim=-1) # 最大值索引，贪婪策略
          # finished sentences should have their next token be a padding token
          if eos_token_id is not None:
              if pad_token_id is None:
                  raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
                 	# unfinished_sequences初始化是torch.ones(batch_size,dtype = torch.long)                   
              next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
          # update generated ids, model inputs, and length for next step
          input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
          # if eos_token was found in one sentence, set sentence to finished
          if eos_token_id_tensor is not None:
              unfinished_sequences = unfinished_sequences.mul(
                  next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
              )

最后generate方法返回一个Union[GenerateOutput,torch.LongTensor]。一般来说是前者。

GenerateNonBeamOutput = Union[GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput]
GenerateBeamOutput = Union[GenerateBeamDecoderOnlyOutput, GenerateBeamEncoderDecoderOutput]
GenerateOutput = Union[GenerateNonBeamOutput, GenerateBeamOutput]

我们就拿GenerateBeamDecoderOnlyOutput来看。

sequences: torch.LongTensor = None # 返回的序列，需要进行decode，一般只用这个
sequences_scores: Optional[torch.FloatTensor] = None # 序列beam_search的分数
scores: Optional[Tuple[torch.FloatTensor]] = None
logits: Optional[Tuple[torch.FloatTensor]] = None
beam_indices: Optional[torch.LongTensor] = None
attentions: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
hidden_states: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
past_key_values: Optional[Tuple[Tuple[Tuple[torch.FloatTensor]]]] = None