one-hot编码

Posted on 2024-04-29 Edited on 2024-07-01 In nlp Views: Waline: Word count in article: 3.8k Reading time ≈ 7 mins.

One-hot 编码是一种将分类变量转换为二进制向量表示的编码方法。它常用于机器学习和数据分析中，特别是在处理分类特征时。在 One-hot 编码中，每个分类变量的每个可能取值都被表示为一个二进制向量的形式。这个二进制向量的长度等于分类变量的取值个数，其中只有一个元素为 1，其余元素都为 0。被编码为 1 的元素对应了分类变量的实际取值，而其他元素表示该分类变量的其他取值。

例如，假设有一个分类变量 “颜色”，可能取值为 “红色”、”蓝色” 和 “绿色”。使用 One-hot 编码时，可以将 “颜色” 变量转换为三个二进制向量，分别表示 “红色”、”蓝色” 和 “绿色”：

红色：[1, 0, 0]
蓝色：[0, 1, 0]
绿色：[0, 0, 1]

这样，原来的单个分类变量被转换为了多个二进制向量，每个向量代表一个分类变量取值。One-hot 编码的好处是，它能够在机器学习算法中更好地处理分类变量，避免了隐式的顺序关系，同时提供了更多的有关分类变量之间的信息。

在实际使用中，可以使用编程语言或库来执行 One-hot 编码，例如 Python 中的 sklearn 库的 OneHotEncoder 类。这些工具可以自动将分类变量转换为 One-hot 编码表示，以供后续的机器学习模型使用。

# 定义碱基与热编码之间的映射关系
base_to_hotcode = {
    'A': [1, 0, 0, 0],
    'T': [0, 1, 0, 0],
    'C': [0, 0, 1, 0],
    'G': [0, 0, 0, 1]
}

# 打开 VCF 文件
with open('your_file.vcf', 'r') as file:
    for line in file:
        if line.startswith('#'):  # 忽略注释行
            continue
        fields = line.strip().split('\t')
        sample_name = fields[9]  # 假设样本名称在第10列
        ref = fields[3]  # 参考基因组
        alt = fields[4]  # 替代基因组
        genotype = fields[8]  # 位点对应的基因型信息
        if genotype == '0/0':
            hotcode = base_to_hotcode[ref]
        elif genotype == '1/1':
            hotcode =```python
            base_to_hotcode[alt]
        elif genotype == '0/1':
            hotcode = [x + y for x, y in zip(base_to_hotcode[ref], base_to_hotcode[alt])]
        else:
            continue  # 跳过无效的基因型

        # 在这里可以根据需要进行进一步操作，如将热编码保存到文件或执行其他处理

import pandas as pd
import numpy as np

# 定义碱基与热编码之间的映射关系
base_to_hotcode = {
    'A': [1, 0, 0, 0],
    'T': [0, 1, 0, 0],
    'C': [0, 0, 1, 0],
    'G': [0, 0, 0, 1]
}

# 打开 VCF 文件
with open('your_file.vcf', 'r') as file:
    lines = file.readlines()

# 提取样本ID
sample_ids = lines[0].strip().split('\t')[9:]

# 初始化一个空的 DataFrame
result = pd.DataFrame()

# 遍历每一行数据
for line in lines[1:]:
    try:
        fields = line.strip().split('\t')
        genotype = fields[8]  # 位点对应的基因型信息
        ref = fields[3]  # 参考基因组
        alt = fields[4]  # 替代基因组

        if genotype == '0/0':
            hotcode = base_to_hotcode[ref]
        elif genotype == '1/1':
            hotcode = base_to_hotcode[alt]
        elif genotype == '0/1':
            hotcode = [x + y for x, y in zip(base_to_hotcode[ref], base_to_hotcode[alt])]
        else:
            continue  # 跳过无效的基因型

        hotcode_array = np.array(hotcode)
        data = pd.DataFrame([hotcode_array], columns=['A', 'T', 'C', 'G'])
        result = result.append(data, ignore_index=True)
    except IndexError:
        continue  # 跳过格式不正确的行

# 将样本ID与独热编码结果合并
result.insert(0, 'Sample_ID', sample_ids)

# 输出结果到文件
result.to_csv('output.txt', sep='\t', index=False)

##creat by lx
import time
import torch
import torch.nn.functional as F
import numpy as np
 
# 定义字母与索引的映射关系
mapping = {'0/0': 0, '0/1': 1, '1/1': 2}
 
# 定义DNA序列列表
sequences = ['0/00/11/1' * 250] * 207  # DNA序列长度为1000bp，共128条序列
 
# 方法一：torch.nn.functional.one_hot函数
start_time = time.time()
 
onehot_sequences1 = []
for sequence in sequences:
    index_sequence = [mapping[base] for base in sequence]
    onehot_sequence = F.one_hot(torch.tensor(index_sequence), num_classes=4).float()
    onehot_sequences1.append(onehot_sequence)
 
end_time = time.time()
method1_time = end_time - start_time
 
# 方法二：torch.eye函数
start_time = time.time()
 
onehot_matrix = torch.eye(4)
onehot_sequences2 = []
for sequence in sequences:
    index_sequence = [mapping[base] for base in sequence]
    onehot_sequence = onehot_matrix[index_sequence]
    onehot_sequences2.append(onehot_sequence)
 
end_time = time.time()
method2_time = end_time - start_time
 
# 方法三：numpy进行转换
start_time = time.time()
 
onehot_matrix = np.eye(4)
onehot_sequences3 = []
for sequence in sequences:
    index_sequence = [mapping[base] for base in sequence]
    onehot_sequence = onehot_matrix[index_sequence]
    onehot_sequences3.append(onehot_sequence)
onehot_sequences3 = torch.from_numpy(np.array(onehot_sequences3)).float()
end_time = time.time()
method3_time = end_time - start_time
 
print("Method 1 time:", method1_time)
print("Method 2 time:", method2_time)
print("Method 3 time:", method3_time)