From CSV to ProtocolBuffer with Python

As we know Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data. Protocol buffer provided smaller, faster and simpler technology. Or we can define it as a way to encoding structured data in an efficient yet extensible format. Basically, protocol buffer supported by many programming language such as Java, Python, Objective-C, C++ and with proto3 it also supports JavaNano, Ruby, Go and C#.

Protocol buffers have many advantages over XML for serializing structured data.

  • protocol buffers are simpler.
  • protocol buffers are 3 to 10 times smaller.
  • protocol buffers are 20 - 100 times faster.
  • protocol buffers are less ambiguous.
  • generate data access classes that are easier to use programmatically.
No Advantages
1 Schemas are awesome
2 Backward compability
3 Less boilerplate code
4 validations and Extensibility
5 Easy language interoperability

for these example, we have CSV with structured bellow,

CompanyCode CompanyName Country Ticker
2377 China Steel Corp TW 2002 TT
2726 Uni-President Enterprises Corp TW 1216 TT
       

then, we would like to serializing those CSV into protocol buffers’ format. we can create a proto file first and save it as company.proto

syntax = "proto3";
package crilist;

message CompanyMap {
    string name = 1;
    int64 code = 2;
    string ticker = 4;
    string countryCode = 5;
}

message CompanyList {
    repeated CompanyMap company = 1;
}

next, we need to generate classes from proto file for Python. in case you don’t have protocol buffer installed on your machine, please follow this link

run this command to generate the classes

protoc --proto_path=. --python_out=. company.proto

then you will have generated classes with name company_pb2.py

now, the part to read the CSV file and convert it to protocol buffers.


import argparse
import pandas as pd
import company_pb2

def saveToPB(df):
    company_list = company_pb2.CompanyList()
    company = company_list.company.add()
    for i,r in df.iterrows():
        company.name = r['Company_Name']
        company.code = r['U3_Company_Number']
        company.ticker = r['Ticker']
        company.countryCode = r['Country']
        with open('output_pb', "ab") as f:
            f.write(company_list.SerializeToString())

def main():
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('-i', '--finput',
                        help='file source to read.', default="input.csv")
    args = parser.parse_args()

    df = pd.read_csv(args.finput)
    saveToPB(df)

if **name** == "**main**":
    main()

those python script will read input.csv and save it to protocol buffers with name output_pb. If you want to stream the protocol buffers, you can use this python script.


# first we need to import the generated classes

import company_pb2

def ListCompany():
    f = open('output_pb', 'rb')
    data = company_pb2.CompanyList()
    data.ParseFromString(f.read())
    f.close()
    print((data))

def main():
    ListCompany()

if **name** == '**main**':
  main()

hopes you getting clear about implementation of protocol buffers using python.


Share this on → Facebook Twitter Google+