How we enriched 7084 records for two professors and a pHD student of University of Hong Kong University of Science and Technology (HKUST)

January 20, 2021

6 min read

A few weeks ago, a student at the University of Hong Kong University of Science and Technology (HKUST) contacted me. The student represented a group of...

A few weeks ago, a student at the University of Hong Kong University of Science and Technology (HKUST) contacted me. The student represented a group of academics who needed to enrich a list of people with work experience and education history.The following article will share how I did this profile enrichment exercise with Proxycurl API.

The problem

Given a list of people, my job is to find their corresponding Professional Social Network profiles and enrich the list with work and education history.

As it turns out, the data that HKUST provided is outdated.

Resolving people to their Professional Social Network Profile

The list provided by HKUST came with a list of people with general but identifiable information about them. The list includes first names, last names, names of the employer, and their role in their organization. These bits of information are an exact match for Proxycurl's Profile General Resolution Endpoint's input parameters.

To resolve loose bits of information of a person to his/her Professional Social Network profile, I wrote the following function in Python code.

async def resolve_profile_url(first_name, last_name, title, country, city, coy_name, company_domain): last_ for _ in range(RETRY_COUNT): try: api_ Social Network/profile/resolve' header_ with httpx.AsyncClient() as client: f"\{coy_name\} \{company_domain\}", 'title': title, 'first_name': first_name, 'last_name': last_name, 'location': f"\{country\} \{city\}", } client.get(api_endpoint, ) if resp.status_code != 200: print(resp.status_code) assert resp.status_ 200 return resp.json()['url'] except KeyboardInterrupt: sys.exit() except Exception as exc: last_ raise last_exc With the profile resolution code written, I iterated through a CSV list of people provided by HKUST and got a corresponding match of Professional Social Network profiles.

Bulk scraping Professional Social Network profiles

The next step is the easy step and is Proxycurl's core competency. Now that I have a list of Professional Social Network profiles, all I need to do is send these Professional Social Network profile URLs to [Proxycurl's Person Profile Endpoint](https://nubela.co/proxycurl/Professional Social Network).

Like the resolution endpoint, I wrote a function that takes a Professional Social Network Profile URL and returns structured data of the profile.

async def get_person_profile(url: str) -> dict: last_ for _ in range(RETRY_COUNT): try: api_ Social Network' header_ with httpx.AsyncClient() as client: url\} client.get(api_endpoint, ) assert resp.status_ 200 return resp.json() except KeyboardInterrupt: sys.exit() except Exception as exc: last_ raise last_exc

Proxycurl API tips

You will notice that the functions I wrote are

  • asynchronous
  • tolerant of unexpected exceptions with a default action of retry

The functions are asynchronous because each request takes an average of 10 seconds. So to maximize throughput, I adopted Proxycurl's best practices. That is to send concurrent requests. In my script, I used 100 workers with Python's library to send concurrent asynchronous API requests.

Each request sent to Proxycurl's API is an on-demand scrape job. There is a non-zero chance that a client-side error or network error. When this happens, the right thing to do is always to retry.

With these two tips, I can scrape the entire HKUST file in one go.

Massaging data for output

The team at HKUST needed the output in a specific (Excel) format. After iterating through the list to resolve for Professional Social Network profiles and then fetching their corresponding profile data, we now have the needed raw data. All I need to do now is to massage the raw data into the CSV format that HKUST wanted it in.

And this is how I did it:

`def massage_data_for_experience(): with open(PATH_2_OUTPUT, 'r') as output_f: output_ for idx, output_row in enumerate(output_csv): output_ profile_ counter += 1

with open(PATH_2_INPUT, 'r') as input_f: input_ for row in input_csv: input_

if input_ output_id: for exp in profile['experiences']: coy_ employment_ starts_ ends_ Professional Social Network_profile_ with open(PATH_2_EXP, 'a+') as f: writer.writerow([output_id, coy_name, title, employment_type, location, starts_at, ends_at, description, Professional Social Network_profile_url]) print(f"{counter}: Done.") ` This is how the final result.

Do you have an enrichment task?

Hey, I will love to help you out. Let me know if you have a task at hand that requires bulk Professional Social Network data scraping. You can shoot us an email at hello@nubela.co.

And if you love reading weekly anecdotes as to how we are solving business problems with our data tools, click [here to subscribe](Nubela Steven Goh Proxycurl CAMPAIGNS All campaigns New campaign TEMPLATES All templates LISTS & SUBSCRIBERS View all lists Housekeeping Blacklist REPORTS See reports Proxycurl Subscriber lists Add subscribers Delete subscribers Mass unsubscribe Export all subscribers Search List: Leads from cold email outreach | Back to lists List settings Subscribe form0 Segments0 Autoresponders0 Custom fields × Subscribe form Ready-to-use subscribe form The following is a 'ready-to-use' subscription form URL you can immediately use to collect sign ups to this list: https://sendy.nubela.co/subscription? Subscribe form HTML code The following is an embeddable subscribe form HTML code for this list. You can setup reCAPTCHA in the brand settings. To subscribe users programmatically, use the API → https://sendy.co/api. Okay Subscribers activity chart Feb 20 Mar 20 Apr 20 May 20 Jun 20 Jul 20 Aug 20 Sep 20 Oct 20 Nov 20 Dec 20 Jan 21 0 100 200 All 234 Active 190 Unconfirmed 0 Unsubscribed 39 Bounced 5 Marked as spam 0 Wart support@leadboxer.com 15 mins ago Subscribed Luis luis.aparicio@talenttools.es 1 hr ago Subscribed Florentin florentin.dam@gmail.com 15 hrs ago Subscribed Rahul Rahul@slintel.com 21 hrs ago Subscribed Khalid khalid@studenthub.co 23 hrs ago Subscribed Paul paul@outboundsales.io 1 day ago Subscribed Safa safa@10xrecruit.com 1 day ago Unsubscribed [No name] readyforpurchase@gmail.com 1 day ago Subscribed Michael michael@persistiq.com 1 day ago Subscribed Xinwen xinwenzhang@hiretual.com 2 days ago Subscribed Basheer basheer@leadbook.com 2 days ago Subscribed Alon al@hangar49.com 4 days ago Subscribed Faruque azam@wscraper.com 6 days ago Subscribed Nathan nathan@execthread.com 6 days ago Unsubscribed Christine christine@execthread.com 6 days ago Unsubscribed Jon jon@xilo-team.com 6 days ago Unsubscribed Joao joao.paiva@olisipo.pt 1 week ago Subscribed Sritam sritam@tecktok.io 1 week ago Subscribed Marc marc.bachmann@novartis.com 1 week ago Subscribed Ignatius ignatius@identive.io 1 week ago Bounced - © 2021 Sendy | Troubleshooting | Support forum | Version 4.0.9 new version: 5.2 available) to our email list :)